Resources / Recordings / Interpretability for Safety and Alignment @ Intelligent Cooperation Workshop

Recording

Interpretability for Safety and Alignment @ Intelligent Cooperation Workshop

With Fazl Barez

Date

21 September 2023

Fazl Barez described some of his recent work, leading through a mechanistic interpretability “recipe”. He discussed interpreting language model neurons at scale, specifically delving into the attention-head neuron interaction and neuroplasticity. After explicating what it means to formulate the problem, he ended by discussing training humans to predict model behaviour.

Interpretability for Safety and Alignment @ Intelligent Cooperation Workshop

Fund the science of the future.

Find us on

Contact

Focus Areas

Resources

Engage

Interpretability for Safety and Alignment @ Intelligent Cooperation Workshop

Fund the science of the future.

Subscribe to our Newsletter:

Find us on

Contact

Focus Areas

Resources

Engage