Cybersecurity In-Depth: Feature articles on security strategy, latest trends, and people to know.
Vendors Training AI With Customer Data Is an Enterprise Risk
While Zoom has scrapped plans to harvest customer content for use in its AI and ML models, the incident should raise concerns for enterprises and consumers alike.
August 25, 2023
Zoom recently received some flak for planning to use customer data to train its machine learning (ML) models. The reality, however, is that the videoconferencing company is not the first, nor will it be the last, to have similar plans.
Enterprises — especially those busy integrating artificial intelligence (AI) tools for internal use — should be viewing these potential plans as emerging challenges that need to be proactively addressed with new processes, oversight, and technology controls where possible.
Abandoned AI Plans
Zoom changed its terms of service earlier this year to give itself the right to use at least some customer content to train its AI and ML models. In early August the company abandoned that change after pushback from some customers who were concerned about their audio, video, chat, and other communications being used this way.
The incident — despite the happy ending for now — is a reminder that companies need to pay closer attention to how technology vendors and other third parties might use their data in the rapidly emerging AI era.
One big mistake is to assume that the data a technology company might collect for AI training is not very different from data the company might collect about service use, says Claude Mandy, chief evangelist, data security at Symmetry Systems.
"Technology companies have been using data about their customers' use of services for a long time," Mandy says. "However, this has generally been limited to metadata about the usage, rather than the content or data being generated by or stored in the services."
In essence, while both involve customer data, there's a big difference between data about the customer and data of the customer, he says.
Clear Distinction
It's a distinction that is already the focus of attention in a handful of lawsuits involving major technology companies and consumers. One of them pits Google against a class of millions of consumers. The lawsuit, filed July in San Francisco, accuses Google of scraping publicly available data on the Internet — including personal and professional information, creative and copyrighted works, photos, and even emails — and using them to train its Bard generative AI technology.
"In the words of the FTC, the entire tech industry is 'sprinting to do the same' — that is, to vacuum up as much data as they can find," the lawsuit alleges.
Another class-action lawsuit accuses Microsoft of doing precisely the same thing to train ChatGPT and other AI tools, such as DALL-E and VALL-E. In July, comedian Sarah Silverman and two authors accused Meta and Microsoft of using their copyrighted material without consent for AI training purposes.
While the lawsuits involve consumers, the takeaway for organizations is that they need to make sure technology companies don't do the same with their data, where possible.
"There is no equivalence between using customer data to improve the user experience and [for] training AI. This is apples and oranges," cautions Denis Mandich co-founder of Qrypt and former member of the US intelligence community. "AI has the additional risk of being individually predictive, putting people and companies in jeopardy."
As an example, he points to a startup using video and file transfer services on a third-party communications platform. A generative AI tool, like ChatGPT, trained on this data could potentially be a good source of information for a competitor to that startup, Mandich says.
"The issue here is about the content, not the user experience for video/audio quality, GUI, etc.," he says.
Oversight and Due Diligence
The big question, of course, is what exactly organizations can do to mitigate the risk of their sensitive data ending up as part of AI models.
A starting point would be to opt out of all AI training and generative AI features that are not under private deployment, says Omri Weinberg, co-founder and chief risk officer at DoControl.
"This precautionary step is important to prevent the external exposure of data [when] we do not have a comprehensive understanding of its intended use and potential risks," he says.
In addition, make sure there are no ambiguities in a technology vendor's terms of service pertaining to company data and how it is used, says Heather Shoemaker, CEO and founder of Language I/O.
"Ethical data usage hinges on policy transparency and informed consent," she notes.
Further, AI tools can store customer information beyond just the training usage, meaning data could potentially be vulnerable in the case of a cyberattack or data breach.”
Qrypt's Mandich advocates that companies insist on technology providers using end-to-end encryption wherever possible.
"There is no reason to risk access by third parties unless they need it for data mining and your company has knowingly agreed to allow it," he says. "This should be explicitly detailed in the EULA and demanded by the client."
The ideal is to have all encryption keys issued and managed by the company and not the provider, Mandich adds.
About the Author
You May Also Like
Applying the Principle of Least Privilege to the Cloud
Nov 18, 2024The Right Way to Use Artificial Intelligence and Machine Learning in Incident Response
Nov 20, 2024Safeguarding GitHub Data to Fuel Web Innovation
Nov 21, 2024The Unreasonable Effectiveness of Inside Out Attack Surface Management
Dec 4, 2024