Cybersecurity In-Depth: Feature articles on security strategy, latest trends, and people to know.
Assume Breach When Building AI Apps
AI jailbreaks are not vulnerabilities; they are expected behavior.
COMMENTARY
If you are still a skeptic about artificial intelligence (AI), you won't be for long. I was recently using Claude.ai to model security data I had at hand into a graph for attack path analysis. While I can do this myself, Claude took care of the task in minutes. More importantly, Claude was just as quick to adapt the script when significant changes were made to the initial requirements. Instead of me having to switch between being a security researcher and data engineer — exploring the graph, identifying a missing property or relation, and adapting the script — I could keep on my researcher hat while Claude played the engineer.
These are moments of clarity, when you realize your toolbox has been upgraded, saving you hours or days of work. It seems like many people have been having those moments, becoming more convinced of the impact AI is going to have in the enterprise.
But AI isn't infallible. There have been a number of public examples of AI jailbreaking, where the generative AI model was fed carefully crafted prompts to do or say unintended things. It can mean bypassing built-in safety features and guardrails or accessing capabilities that are supposed to be restricted. AI companies are trying to solve jailbreaking; some say they have either done so or are making significant progress. Jailbreaking is treated as a fixable problem — a quirk we'll soon get rid of.
As part of that mindset, AI vendors are treating jailbreaks as vulnerabilities. They expect researchers to submit their latest prompts to a bug-bounty program instead of publishing them on social media for laughs. Some security leaders are talking about AI jailbreaks in terms of responsible disclosure, creating a clear contrast with those supposedly irresponsible people who disclose jailbreaks publicly.
Reality Sees Things Differently
Meanwhile, AI jailbreaking communities are popping up on social media and community platforms, such as Discord and Reddit, like mushrooms after the rain. These communities are more akin to gaming speedrunners than to security researchers. Whenever a new generative AI model is released, these communities race to see who can find a jailbreak first. It usually takes minutes, and they never fail. These communities do not know about, of care about, responsible disclosure.
To quote an X post from Pliny the Prompter, a popular social media account from the AI breaking community: "circumventing AI 'safety' measures is getting easier as they become more powerful, not harder. this may seem counterintuitive but it's all about the surface area of attack, which seems to be expanding much faster than anyone on defense can keep up with."
Let's imagine for a second that vulnerability disclosure could work — that we can get every person on the planet to submit their evil prompts to a National Vulnerability Database-style repository before sharing it with their friends. Would that actually help? Last year at DEF CON, the AI village hosted the largest public AI red-teaming event, where they reportedly collected over 17,000 jailbreaking conversations. This was an incredible effort with huge benefits to our understanding of securing AI, but it did not make any significant change to the rate at which AI jailbreaks are discovered.
Vulnerabilities are quirks of the application in which they were found. If the application is complex, it has more surface for vulnerabilities. AI captures human languages so well, but can we really hope to enumerate all quirks of the human experience?
Stop Worrying About Jailbreaks
We need to operate under the assumption that AI jailbreaks are trivial. Don't give your AI application capabilities it should not be using. If the AI application can perform actions and relies on people not knowing those prompts as a defense mechanism, expect those actions to be eventually exploited by a persistent user.
AI startups are suggesting we think of AI agents as employees who know a lot of facts but need guidance on applying their knowledge to the real world. As security professionals, I believe we need a different analogy: I suggest you think of an AI agent as an expert you want to hire, even though that expert defrauded their previous employer. You really need this employee, so you put a bunch of guardrails in place to ensure this employee won't defraud you as well. But at the end of the day, every data and access you give this problematic employee exposes your organization and is risky. Instead of trying to create systems that can't be jailbroken, let's focus on applications that are easy to monitor for when they inevitably are, so we can quickly respond and limit the impact.
About the Author
You May Also Like