LLMs Open to Manipulation Using Doctored Images, Audio
As LLMs begin to integrate multimodal capabilities, attackers could use hidden instructions in images and audio to get a chatbot to respond the way they want, say researchers at Black Hat Europe 2023.
December 5, 2023
Attackers could soon begin using malicious instructions hidden in strategically placed images and audio clips online to manipulate responses to user prompts from large language models (LLMs) behind AI chatbots such as ChatGPT.
Adversaries could use these so-called "indirect prompt injection" attacks to redirect users to malicious URLs, extract personal information from users, deliver payloads, and take other malicious actions. Such attacks could become a major issue as LLMs become increasingly multimodal or are capable of responding contextually to inputs that combine text, audio, pictures, and even video.
Hiding Instructions in Images and Audio
At Black Hat Europe 2023 this week, researchers from Cornell University will demonstrate an attack they developed that uses images and sounds to inject instructions into multimodal LLMs that causes the model to output attacker-specified text and instructions. Their proof-of-concept attack examples targeted PandaGPT and LLaVa multimodal LLMs.
"The attacker’s goal is to steer the conversation between a user and a multi-modal chatbot," the researchers wrote in a paper titled "Abusing Images and Sounds for Indirect Instruction Injection in Multi-Modal LLM" that explains their attack. "To this end, the attacker blends a prompt into an image or audio clip and manipulates the user into asking the chatbot about it." The researchers plan on showing how, once the chatbot processes the input, it will output either an attacker-injected prompt hidden in the audio or image file or follow whatever instructions the attacker might have included in the prompt.
As an example, the researchers blended an instruction into an audio clip available online that caused PandaGPT to respond with an attacker-specific string. If a user had input the audio clip into the chatbot and asked for a description of the sound, the model's response would have directed the user to visit a malicious URL ostensibly to find out more about the "very rare bird" that had produced the sound.
In another example, the researchers blended an instruction in an image of a building, that would have caused LLaVA to chat like Harry Potter if a user had input the image into the chatbot and asked a question about it.
Ben Nassi, a researcher at Cornell University and one of the authors of the report, says one of the goals of their research was to find ways to inject prompts indirectly into a multimodal chatbot in a manner undetectable to the user. The other was to ensure they could "perturb" an image or audio without affecting the LLMs ability to correctly answer questions about the input.
Nassi describes the research as building on studies by others showing how LLMs are vulnerable to prompt injection attacks where an adversary might engineer inputs or prompts in such a manner as to intentionally influence the model's output. One recent example is a study by researchers at Google's DeepMind and six universities that showed how ChatGPT could be manipulated into regurgitating large amounts of its training data — including sensitive and personally identifying information — simply by prompting it to repeat certain words such as "poem" and "company" forever.
The attack that Nassi and his team will demonstrate at Black Hat is different in that in involves an indirect prompt. In other words, the user is not so much the attacker — as is the case with regular prompt injection — but rather the victim.
"We don't use the user as an adversary," says Eugene Bagdasaryan, a researcher at Cornell and the lead author on the report. The other two authors are Cornell researchers Tsung-Yin Hsieh and Vitaly Shmatikov. "In this case, we demonstrate that the user has no idea that the image or the audio contains something bad," Bagdasaryan adds.
Indirect Prompt Injection Attacks
The new paper is not the first to explore the idea of indirect prompt injection as a way to attack LLMs. In May, researchers at Germany's CISPA Helmholtz Center for Information Security at Saarland University and Sequire Technology published a report that described how an attacker could exploit LLM models by injecting hidden prompts into data that the model would likely retrieve when responding to a user input. "The easily extensible nature of LLMs' functionalities via natural prompts can enable more straightforward attack tactics," the researchers concluded.
In that case, however, the attack involved strategically placed text prompts. Bagdasaryan says their attack is different because it shows how an attacker could inject malicious instructions into audio and image inputs as well, making them potentially harder to detect.
One other distinction with the attacks involving manipulated audio and image inputs is that the chatbot will continue to respond in its instructed manner during the entirely of a conversation. For instance, prompting the chatbot to respond in Harry Potter-like fashion causes it to continue to do so even when the user may have stopped asking about the specific image or audio sample.
Potential ways to direct a user to a weaponized image or audio clip could include a phishing or social engineering lure to a webpage with an interesting image or an email with an audio clip. "When the victim directly inputs the image or the clip into an isolated LLM and asks questions about it, the model will be steered by attacker-injected prompts," the researchers wrote in their paper.
The research is significant because many organizations are rushing to integrate LLM capabilities into their applications and operations. Attackers that devise ways to sneak poisoned text, image, and audio prompts into these environments could cause significant damage.
Read more about:
Black Hat NewsAbout the Author
You May Also Like