Fazl Barez
University of Oxford
2026, Berlin, AI for Security
An Automated Interpretability-Driven System for Model Auditing and Control
We will build an agentic interpretability system in which experts converse with an agent to check what internal computations led to a model’s answer, and which fixes model errors with minimal side effects. This will allow non-AI experts in safety-critical domains to contribute to AI alignment through their area of expertise.
Biography
How do we make AI systems that we can actually inspect, correct, and trust? Fazl Barez is a Principal Investigator at the University of Oxford leading research on technical AI safety, interpretability, and governance. He teaches Oxford’s AI Safety and Alignment course, has consulted for Anthropic, and held fellowships at RAND Corporation. His work bridges mechanistic interpretability, model auditing, and policy — translating what’s found inside models into tools regulators and auditors can use.