C. Schroeder de Witt | Secret Collusion Among Generative AI Agents: Toward Multi-Agent Security
With Christian Schroeder de Witt
I argue that the ability of advanced AI agents to use perfect stealth will soon be AI Safety’s biggest concern.Importantly, information-theoretic undetectability undermines our current approaches to multi-agent security.In this talk, I will be focusing on the matter of steganographic collusion among generative AI agents.I will give an introduction to steganography, talk about how agents could be incentivised to use it in unintended ways, and present a novel model evaluation framework, as well as mitigation measures. How do we design autonomous systems and environments in which undetectable actions cannot cause unacceptable damages?