Resources / Recordings / Interpretability for Safety and Alignment @ Intelligent Cooperation Workshop

Recording

Interpretability for Safety and Alignment @ Intelligent Cooperation Workshop

With Fazl Barez


Date

Fazl Barez described some of his recent work, leading through a mechanistic interpretability “recipe”. He discussed interpreting language model neurons at scale, specifically delving into the attention-head neuron interaction and neuroplasticity. After explicating what it means to formulate the problem, he ended by discussing training humans to predict model behaviour.

Fund the science of the future.

Donate today