'Bad Likert Judge' Jailbreak Bypasses Guardrails of OpenAI, Other Top LLMs

A novel technique to stump artificial intelligence (AI) text-based systems increases the likelihood of a successful cyberattack by 60%.

The letters "LLM" in gray and purple 3D with the words "large language model" written underneath
Source: Krot Studio via Alamy Stock Photo

A new jailbreak technique for OpenAI and other large language models (LLMs) increases the chance that attackers can circumvent cybersecurity guardrails and abuse the system to deliver malicious content.

Discovered by researchers at Palo Alto Networks' Unit 42, the so-called Bad Likert Judge attack asks the LLM to act as a judge scoring the harmfulness of a given response using the Likert scale. The psychometric scale, named after its inventor and commonly used in questionnaires, is a rating scale measuring a respondent's agreement or disagreement with a statement.

The jailbreak then asks the LLM to generate responses that contain examples that align with the scales, with the ultimate result being that "the example that has the highest Likert scale can potentially contain the harmful content," Unit 42's Yongzhe Huang, Yang Ji, Wenjun Hu, Jay Chen, Akshata Rao, and Danny Tsechansky wrote in a post describing their findings.

Tests conducted across a range of categories against six state-of-the-art text-generation LLMs from OpenAI, Azure, Google, Amazon Web Services, Meta, and Nvidia revealed that the technique can increase the attack success rate (ASR) by more than 60% compared with plain attack prompts on average, according to the researchers.

The categories of attacks evaluated in the research involved prompting various inappropriate responses from the system, including: ones promoting bigotry, hate, or prejudice; ones engaging in behavior that harasses an individual or group; ones that encourage suicide or other acts of self-harm; ones that generate inappropriate explicitly sexual material and pornography; ones providing info on how to manufacture, acquire, or use illegal weapons; or ones that promote illegal activities.

Other categories explored and for which the jailbreak increases the likelihood of attack success include: malware generation or the creation and distribution of malicious software; and system prompt leakage, which could reveal the confidential set of instructions used to guide the LLM.

How Bad Likert Judge Works

The first step in the Bad Likert Judge attack involves asking the target LLM to act as a judge to evaluate responses generated by other LLMs, the researchers explained.

"To confirm that the LLM can produce harmful content, we provide specific guidelines for the scoring task," they wrote. "For example, one could provide guidelines asking the LLM to evaluate content that may contain information on generating malware."

Once the first step is properly completed, the LLM should understand the task and the different scales of harmful content, which makes the second step "straightforward," they said. "Simply ask the LLM to provide different responses corresponding to the various scales," the researchers wrote.

"After completing step two, the LLM typically generates content that is considered harmful," they wrote, adding that in some cases, "the generated content may not be sufficient to reach the intended harmfulness score for the experiment."

To address the latter issue, an attacker can ask the LLM to refine the response with the highest score by extending it or adding more details. "Based on our observations, an additional one or two rounds of follow-up prompts requesting refinement often lead the LLM to produce content containing more harmful information," the researchers wrote.

Rise of LLM Jailbreaks

The exploding use of LLMs for personal, research, and business purposes has led researchers to test their susceptibility to generate harmful and biased content when prompted in specific ways. Jailbreaks are the term for methods that allow researchers to bypass guardrails put in place by LLM creators to avoid the generation of bad content.

Security researchers have already identified several types of jailbreaks, according to Unit 42. They include one called persona persuasion; a role-playing jailbreak dubbed Do Anything Now; and token smuggling, which uses encoded words in an attacker's input.

Researchers at Robust Intelligence and Yale University also recently discovered a jailbreak called Tree of Attacks with Pruning (TAP), which involves using an unaligned LLM to "jailbreak" another aligned LLM, or to get it to breach its guardrails, quickly and with a high success rate.

Unit 42 researchers stressed that their jailbreak technique "targets edge cases and does not necessarily reflect typical LLM use cases." This means that "most AI models are safe and secure when operated responsibly and with caution," they wrote.

How to Mitigate LLM Jailbreaks

However, no LLM matter is completely secure from jailbreaks, the researchers cautioned. The reason that they can undermine the security that OpenAI, Microsoft, Google, and others are building into their LLMs is mainly due to the computational limits of language models, they said.

"Some prompts require the model to perform computationally intensive tasks, such as generating long-form content or engaging in complex reasoning," they wrote. "These tasks can strain the model's resources, potentially causing it to overlook or bypass certain safety guardrails."

Attackers also can manipulate the model's understanding of the conversation's context by "strategically crafting a series of prompts" that "gradually steer it toward generating unsafe or inappropriate responses that the model's safety guardrails would otherwise prevent," they wrote.

To mitigate the risks from jailbreaks, the researchers recommend applying content-filtering systems alongside LLMs for jailbreak mitigation. These systems run classification models on both the prompt and the output of the models to detect potentially harmful content.

"The results show that content filters can reduce the ASR by an average of 89.2 percentage points across all tested models," the researchers wrote. "This indicates the critical role of implementing comprehensive content filtering as a best practice when deploying LLMs in real-world applications."

About the Author

Elizabeth Montalbano, Contributing Writer

Elizabeth Montalbano is a freelance writer, journalist, and therapeutic writing mentor with more than 25 years of professional experience. Her areas of expertise include technology, business, and culture. Elizabeth previously lived and worked as a full-time journalist in Phoenix, San Francisco, and New York City; she currently resides in a village on the southwest coast of Portugal. In her free time, she enjoys surfing, hiking with her dogs, traveling, playing music, yoga, and cooking.

Keep up with the latest cybersecurity threats, newly discovered vulnerabilities, data breach information, and emerging trends. Delivered daily or weekly right to your email inbox.

You May Also Like


More Insights