'Constitutional Classifiers' Technique Mitigates GenAI Jailbreaks'Constitutional Classifiers' Technique Mitigates GenAI Jailbreaks

Anthropic says its Constitutional Classifiers approach offers a practical way to make it harder for bad actors to try and coerce an AI model off its guardrails.

4 Min Read
Introduction to Claude 3, seen on the Anthropic website
Source: Tada Images via Shutterstock

Researchers at Anthropic, the company behind the Claude AI assistant, have developed an approach they believe provides a practical, scalable method to make it harder for malicious actors to jailbreak or bypass the built-in safety mechanisms of a range of large language models (LLMs).

The approach employs a set of natural language rules — or a "constitution" — to create categories of permitted and disallowed content in an AI model's input and output, and then uses synthetic data to train the model to recognize and apply those content classifiers.

"Constitutional Classifiers" Anti-Jailbreak Technique

In a technical paper released this week, the Anthropic researchers said their so-called Constitutional Classifiers approach was as effective against universal jailbreaks, withstanding more than 3,000 hours of human red-teaming by some 183 white-hat hackers through the HackerOne bug bounty program.

"These Constitutional Classifiers are input and output classifiers trained on synthetically generated data that filter the overwhelming majority of jailbreaks with minimal over-refusals and without incurring a large compute overhead," the researchers said in an related blog post. They have established a demo website where anyone with experience jailbreaking an LLM can try out their system for the next week (Feb. 3 to Feb. 10).

Related:AI Malware Dressed Up as DeepSeek Packages Lurk in PyPi

In the context of generative AI (GenAI) models, a jailbreak is any prompt or set of prompts that causes the model to bypass its built-in content filters, safety mechanisms, and ethical constraints. They typically involve a researcher — or a bad actor — crafting specific input sequences, using linguistic tricks and even role-playing scenarios to trick an AI model into escaping its protective guardrails and spewing out potentially dangerous, malicious, and incorrect content.

The most recent example involves researchers at Wallarm extracting secrets from DeepSeek, the Chinese generative AI tool that recently upended long held notions of just how much compute power is required to power an LLM. Since ChatGPT exploded on the scene in November 2022, there have been multiple other examples including one where researchers used one LLM to jailbreak a second, another involving the repetitive use of certain words to get an LLM to spill its training data and another through doctored images and audio.

Balancing Effectiveness With Efficiency

In developing the Constitutional Classifiers system, the researchers wanted to ensure a high rate of effectiveness against jailbreaking attempts without drastically impacting the ability for people to extract legitimate information from an AI model. One simplistic example was ensuring the model could distinguish between a prompt asking for a list of common medications or for explaining the properties of household chemicals versus a request on where to acquire a restricted chemical or purifying it. The researchers also wanted to ensure minimal additional computing overhead when using the classifiers.

Related:DeepSeek Jailbreak Reveals Its Entire System Prompt

In tests, researchers had a jailbreak success rate of 86% on a version of Claude with no defensive classifiers, compared to 4.4% on one using a Constitutional Classifier. According to the researchers, using the classifier increased refusal rates by less than 1% and compute costs by nearly 24% compared to the unguarded model.

LLM Jailbreaks: A Major Threat

Jailbreaks have emerged as a major consideration when it comes to making GenAI models with sophisticated scientific capabilities widely available. The concern is that it gives even an unskilled actor the opportunity to "uplift" their skills to expert-level capabilities. This can become an especially big problem when it comes to trying to jailbreak LLMs into divulging dangerous chemical, biological, radiological, or nuclear (CBRN) information, the Anthropic researchers noted.

Related:Code-Scanning Tool's License at Heart of Security Breakup

Their work focused on how to augment an LLM with classifiers that monitor an AI model's inputs and outputs and blocks potentially harmful content. Instead of using hard-coded static filtering, they wanted something that would have a more sophisticated understanding of a model's guardrails and act as a real-time filter when generating responses or receiving inputs. "This simple approach is highly effective: in over 3,000 hours of human red teaming on a classifier-guarded system, we observed no successful universal jailbreaks in our target...domain," the researchers wrote. The red-team tests involved the bug bounty hunters trying to obtain answers from Claude AI to a set of harmful questions involving CBRN risks, using thousands of known jailbreaking hacks.

About the Author

Jai Vijayan, Contributing Writer

Jai Vijayan is a seasoned technology reporter with over 20 years of experience in IT trade journalism. He was most recently a Senior Editor at Computerworld, where he covered information security and data privacy issues for the publication. Over the course of his 20-year career at Computerworld, Jai also covered a variety of other technology topics, including big data, Hadoop, Internet of Things, e-voting, and data analytics. Prior to Computerworld, Jai covered technology issues for The Economic Times in Bangalore, India. Jai has a Master's degree in Statistics and lives in Naperville, Ill.

Keep up with the latest cybersecurity threats, newly discovered vulnerabilities, data breach information, and emerging trends. Delivered daily or weekly right to your email inbox.

You May Also Like


More Insights