Simple Hacking Technique Can Extract ChatGPT Training Data
Apparently all it takes to get a chatbot to start spilling its secrets is prompting it to repeat certain words like "poem" forever.
December 1, 2023
Can getting ChatGPT to repeat the same word over and over again cause it to regurgitate large amounts of its training data, including personally identifiable information and other data scraped from the Web?
The answer is an emphatic yes, according to a team of researchers at Google DeepMind, Cornell University, and four other universities who tested the hugely popular generative AI chatbot's susceptibility to leaking data when prompted in a specific way.
'Poem' as a Trigger Word
In a report this week, the researchers described how they got ChatGPT to spew out memorized portions of its training data merely by prompting it to repeat words like "poem," "company," "send," "make," and "part" forever.
For example, when the researchers prompted ChatGPT to repeat the word "poem" forever, the chatbot initially responded by repeating the word as instructed. But after a few hundred times, ChatGPT began generating "often nonsensical" output, a small fraction of which included memorized training data such as an individual's email signature and personal contact information.
The researchers discovered that some words were better at getting the generative AI model to spill memorized data than others. For instance, prompting the chatbot to repeat the word "company" caused it to emit training data 164 times more often than other words, such as "know."
Data that the researchers were able to extract from ChatGPT in this manner included personally identifiable information on dozens of individuals; explicit content (when the researchers used an NSFW word as a prompt); verbatim paragraphs from books and poems (when the prompts contained the word "book" or "poem"); and URLs, unique user identifiers, bitcoin addresses, and programming code.
A Potentially Big Privacy Issue?
"Using only $200 USD worth of queries to ChatGPT (gpt-3.5-turbo), we are able to extract over 10,000 unique verbatim memorized training examples," the researchers wrote in their paper titled "Scalable Extraction of Training Data from (Production) Language Models."
"Our extrapolation to larger budgets suggests that dedicated adversaries could extract far more data," they wrote. The researchers estimated an adversary could extract 10 times more data with more queries.
Dark Reading's attempts to use some of the prompts in the study did not generate the output the researchers mentioned in their report. It's unclear if that's because ChatGPT creator OpenAI has addressed the underlying issues after the researchers disclosed their findings to the company in late August. OpenAI did not immediately respond to a Dark Reading request for comment.
The new research is the latest attempt to understand the privacy implications of developers using massive datasets scraped from different — and often not fully disclosed — sources to train their AI models.
Previous research has shown that large language models (LLMs) such as ChatGPT often can inadvertently memorize verbatim patterns and phrases in their training datasets. The tendency for such memorization increases with the size of the training data.
Researchers have shown how such memorized data is often discoverable in a model's output. Other researchers have shown how adversaries can use so-called divergence attacks to extract training data from an LLM. A divergence attack is one in which an adversary uses intentionally crafted prompts or inputs to get an LLM to generate outputs that diverge significantly from what it would typically produce.
In many of these studies, researchers have used open source models — where the training datasets and algorithms are known — to test the susceptibility of LLM to data memorization and leaks. The studies have also typically involved base AI models that have not been aligned to operate in a manner like an AI chatbot such as ChatGPT.
A Divergence Attack on ChatGPT
The latest study is an attempt to show how a divergence attack can work on a sophisticated closed, generative AI chatbot whose training data and algorithms remain mostly unknown. The study involved the researchers developing a way to get ChatGPT "to 'escape' out of its alignment training" and getting it to "behave like a base language model, outputting text in a typical Internet-text style." The prompting strategy they discovered (of getting ChatGPT to repeat the same word incessantly) caused precisely such an outcome, resulting in the model spewing out memorized data.
To verify that the data the model was generating was indeed training data, the researchers first built an auxiliary dataset containing some 9 terabytes of data from four of the largest LLM pre-training datasets — The Pile, RefinedWeb, RedPajama, and Dolma. They then compared the output data from ChatGPT against the auxiliary dataset and found numerous matches.
The researchers figured they were likely underestimating the extent of data memorization in ChatGPT because they were comparing the outputs of their prompting only against the 9-terabyte auxiliary dataset. So they took some 494 of ChatGPT's outputs from their prompts and manually searched for verbatim matches on Google. The exercise yielded 150 exact matches, compared to just 70 against the auxiliary dataset.
"We detect nearly twice as many model outputs are memorized in our manual search analysis than were detected in our (comparatively small)" auxiliary dataset, the researchers noted. "Our paper suggests that training data can easily be extracted from the best language models of the past few years through simple techniques."
The attack that the researchers described in their report is specific to ChatGPT and does not work against other LLMs. But the paper should help "warn practitioners that they should not train and deploy LLMs for any privacy-sensitive applications without extreme safeguards," they noted.
About the Author
You May Also Like