Georg Lange

Independent

2026, Berlin, AI for Security

Agentic Mechanistic Interpretability: Dictionary Learning and Automated Circuit Tracing

I’m developing AI agents that accelerate mechanistic interpretability by doing circuit tracing the way a human researcher would: forming hypotheses about features and circuits in LLMs, testing them with causal interventions, and iterating toward faithful, readable explanations. The core technical goal is to use these agents to improve dictionary learning and feature representations so circuit tracing works end-to-end on real behaviors we care about. The longer term motivation is to make it practical to trace the computations behind high stakes behaviors so we can better audit for deception, hidden intent, and other misalignment signals in a way that goes beyond black box evaluation.

Biography

I’m an independent researcher working on Mechanistic Interpretability for LLMs. I work on circuit tracing, dictionary learning, and automated interpretability to create technology broadly useful for detecting deception, intent, and misalignment.