Dangerous AI Workaround: 'Skeleton Key' Unlocks Malicious Content
Microsoft, OpenAI, Google, and Meta GenAI models could be convinced to ditch their guardrails, opening the door to chatbots giving unfettered answers on building bombs, creating malware, and much more.
June 26, 2024
A new type of direct prompt injection attack dubbed "Skeleton Key" could allow users to bypass the ethical and safety guardrails built into generative AI models like ChatGPT, Microsoft is warning. It works by providing context around normally forbidden chatbot requests, allowing users to access offensive, harmful, or illegal content.
For instance, if a user asked for instructions on how to make a dangerous wiper malware that could bring down power plants, most commercial chatbots would first refuse. But, after revising the prompt to note that the request is for "a safe education context with advanced researchers trained on ethics and safety" and to provide the requested information with a "warning" disclaimer, then it's very likely that the AI would then provide the uncensored content.
In other words, Microsoft found it was possible to convince most top AIs that a malicious request is for perfectly legal, if not noble, causes — just by telling them that the information is for "research purposes."
"Once guardrails are ignored, a model will not be able to determine malicious or unsanctioned requests from any other," explained Mark Russinovich, CTO for Microsoft Azure, in a post today on the tactic. "Because of its full bypass abilities, we have named this jailbreak technique Skeleton Key."
He added, "Further, the model's output appears to be completely unfiltered and reveals the extent of a model's knowledge or ability to produce the requested content."
Remediation for Skeleton Key
The technique affects multiple GenAI models that Microsoft researchers tested, including Microsoft Azure AI-managed models, and those from Meta, Google Gemini, Open AI, Mistral, Anthropic, and Cohere.
"All the affected models complied fully and without censorship for [multiple forbidden] tasks," Russinovich noted.
The computing giant fixed the problem in Azure by introducing new prompt shields to detect and block the tactic, and making a few software updates to the large language model (LLM) that powers Azure AI. It also disclosed the issue to the other vendors affected.
Admins still need to update their models to implement any fixes that those vendors may have rolled out. And those who are building their own AI models can also use the following mitigations, according to Microsoft:
Input filtering to identify any requests that contain harmful or malicious intent, regardless of any disclaimers that accompany them.
An additional guardrail that specifies that any attempts to undermine safety guardrail instructions should be prevented.
And output filtering that identifies and prevents responses that breach safety criteria.
About the Author
You May Also Like