'Constitutional Classifiers' Technique Mitigates GenAI Jailbreaks'Constitutional Classifiers' Technique Mitigates GenAI Jailbreaks
Anthropic says its Constitutional Classifiers approach offers a practical way to make it harder for bad actors to try and coerce an AI model off its guardrails.
February 3, 2025
Researchers at Anthropic, the company behind the Claude AI assistant, have developed an approach they believe provides a practical, scalable method to make it harder for malicious actors to jailbreak or bypass the built-in safety mechanisms of a range of large language models (LLMs).
The approach employs a set of natural language rules — or a "constitution" — to create categories of permitted and disallowed content in an AI model's input and output, and then uses synthetic data to train the model to recognize and apply those content classifiers.
"Constitutional Classifiers" Anti-Jailbreak Technique
In a technical paper released this week, the Anthropic researchers said their so-called Constitutional Classifiers approach was as effective against universal jailbreaks, withstanding more than 3,000 hours of human red-teaming by some 183 white-hat hackers through the HackerOne bug bounty program.
"These Constitutional Classifiers are input and output classifiers trained on synthetically generated data that filter the overwhelming majority of jailbreaks with minimal over-refusals and without incurring a large compute overhead," the researchers said in an related blog post. They have established a demo website where anyone with experience jailbreaking an LLM can try out their system for the next week (Feb. 3 to Feb. 10).
In the context of generative AI (GenAI) models, a jailbreak is any prompt or set of prompts that causes the model to bypass its built-in content filters, safety mechanisms, and ethical constraints. They typically involve a researcher — or a bad actor — crafting specific input sequences, using linguistic tricks and even role-playing scenarios to trick an AI model into escaping its protective guardrails and spewing out potentially dangerous, malicious, and incorrect content.
The most recent example involves researchers at Wallarm extracting secrets from DeepSeek, the Chinese generative AI tool that recently upended long held notions of just how much compute power is required to power an LLM. Since ChatGPT exploded on the scene in November 2022, there have been multiple other examples including one where researchers used one LLM to jailbreak a second, another involving the repetitive use of certain words to get an LLM to spill its training data and another through doctored images and audio.
Balancing Effectiveness With Efficiency
In developing the Constitutional Classifiers system, the researchers wanted to ensure a high rate of effectiveness against jailbreaking attempts without drastically impacting the ability for people to extract legitimate information from an AI model. One simplistic example was ensuring the model could distinguish between a prompt asking for a list of common medications or for explaining the properties of household chemicals versus a request on where to acquire a restricted chemical or purifying it. The researchers also wanted to ensure minimal additional computing overhead when using the classifiers.
In tests, researchers had a jailbreak success rate of 86% on a version of Claude with no defensive classifiers, compared to 4.4% on one using a Constitutional Classifier. According to the researchers, using the classifier increased refusal rates by less than 1% and compute costs by nearly 24% compared to the unguarded model.
LLM Jailbreaks: A Major Threat
Jailbreaks have emerged as a major consideration when it comes to making GenAI models with sophisticated scientific capabilities widely available. The concern is that it gives even an unskilled actor the opportunity to "uplift" their skills to expert-level capabilities. This can become an especially big problem when it comes to trying to jailbreak LLMs into divulging dangerous chemical, biological, radiological, or nuclear (CBRN) information, the Anthropic researchers noted.
Their work focused on how to augment an LLM with classifiers that monitor an AI model's inputs and outputs and blocks potentially harmful content. Instead of using hard-coded static filtering, they wanted something that would have a more sophisticated understanding of a model's guardrails and act as a real-time filter when generating responses or receiving inputs. "This simple approach is highly effective: in over 3,000 hours of human red teaming on a classifier-guarded system, we observed no successful universal jailbreaks in our target...domain," the researchers wrote. The red-team tests involved the bug bounty hunters trying to obtain answers from Claude AI to a set of harmful questions involving CBRN risks, using thousands of known jailbreaking hacks.
About the Author
You May Also Like
Shifting Left: DevSecOps in the Cloud
Feb 4, 2025Uncovering Threats to Your Mainframe & How to Keep Host Access Secure
Feb 13, 2025Securing the Remote Workforce
Feb 20, 2025Emerging Technologies and Their Impact on CISO Strategies
Feb 25, 2025How CISOs Navigate the Regulatory and Compliance Maze
Feb 26, 2025