Constitutional Classifiers: Defending against universal jailbreaks

from blog Simon Willison's Weblog, | ↗ original
Constitutional Classifiers: Defending against universal jailbreaks Interesting new research from Anthropic, resulting in the paper Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming. From the paper: In particular, we introduce Constitutional Classifiers, a framework that trains classifier...