Recording
Interpretability for Safety and Alignment @ Intelligent Cooperation Workshop
With Fazl Barez
Fazl Barez described some of his recent work, leading through a mechanistic interpretability “recipe”. He discussed interpreting language model neurons at scale, specifically delving into the attention-head neuron interaction and neuroplasticity. After explicating what it means to formulate the problem, he ended by discussing training humans to predict model behaviour.