Expert Insights: Training the Data Elephant in the AI RoomExpert Insights: Training the Data Elephant in the AI Room

Be aware of the risk of inadvertent data exposure in machine learning systems.

Gary McGraw Ph.D., Co-Founder, Berryville Institute of Machine Learning

February 4, 2022

5 Min Read
Dark Reading logo in a gray background | Dark Reading

One of the trickiest aspects of actually using machine learning (ML) in practice is relegating the right amount of attention to the data problem. This is something I discussed in two previous Dark Reading columns about machine learning security, Building Security into Software and How to Secure Machine Learning

You see, the “machine” in ML is really constructed directly from a bunch of data.

My early estimations of security risk involved in machine learning make the strong claim that data-related risks are responsible for 60% of the overall risk with the rest of the risks (say, algorithm or online operations risks) accounting for the remaining 40%. I found that both surprising and concerning when I started working on ML security in 2019, mostly because not enough attention is being placed on data-related risks. But you know what? Even that estimation got things wrong.

When you consider the full ML lifecyle, data-related risks gain even more prominence. That’s because in terms of sheer data exposure it may often be the case that putting ML into practice exposes even more data than training or fielding the ML model in the first place. Way more. Here’s why.

Data Involved in Training

Recall that when you “train up” an ML algorithm - say using supervised learning for a simple categorization or prediction task - you must think carefully about the datasets you’re using. In many cases, the data used to build the ML in the first place come from a data warehouse storing data that are both business confidential and carry a strong privacy burden.

An example may help. Consider a banking application of ML that helps a loan officer decide whether or not to proceed with a loan. The ML problem at hand is predicting whether the applicant will pay the loan back. Using data scraped from past loans made by the institution, an ML system can be trained up to make this prediction.

Obviously in this example, the data from the data warehouse used to train the algorithm include both strictly private information, some of which may be protected (like, say, salary and employment information, race, and gender), as well as business confidential information (like, say, whether a loan was offered and at what rate of return).

The tricky data security aspect of ML involves using these data in a safe, secure, and legal manner. Gathering and building the training, testing, and evaluation sets is non-trivial and bears some risk. Fielding the trained ML model itself also bears some risk as the data are in some sense “built right in” to the ML model (and thus subject to leaking back out, sometimes unintentionally).

For the sake of filling in our example, let's say that the ML system we’re postulating is trained up inside the data warehouse, but that it is operated in the cloud and can be used by hundreds of regional and local branches of the institution.

Clearly data exposure is a thing to think carefully about when it comes to ML.

Data Involved in Operations

But wait, there’s more. When an ML system like the one we’re discussing is fielded, it works as follows. New situations are gathered and built into “queries” using the same kind of representation used to build the ML model in the first place. Those queries are then presented to the model which uses them as inputs to return a prediction or categorization relevant to the task at hand. (This is what ML people mean when they say auto-associative prediction.)

Back to our loan example, when a loan application comes in through a loan officer in a branch office, some of that information will be used to build and run a query through the ML model as part of the loan decision-making process. In our example, this query is likely to include both business confidential and protected private information subject to regulatory control.

The institution will very likely put the ML system to good use over hundreds of thousands (or maybe even millions) of customers seeking loans. Now think about the data exposure risk brought to bear by the compounded queries themselves. That is a very large pile of data. Some analysts estimate that 95% of ML data exposure comes through operational exposure of this sort. Regardless of the actual breakdown, it is very clear that operational data exposure is something to think carefully about.

Limiting Data Exposure

How can this operational data exposure risk built into the use of ML be properly mitigated?

There are a number of ways to do this. One might be encrypting the queries on their way to the ML system, then decrypting them only when they are run through the ML. Depending on where the ML system is being run and who is running it, that may work. As one example, Google’s BigQuery system supports customer-managed keys to do this kind of thing.

Another, more clever solution may be to stochastically transform the representation of the query fields, thereby minimizing the exposure of the original information to the ML's decision process without affecting its accuracy. This involves some insight into how the ML makes its decisions, but in many cases can be used to shrink-wrap queries down significantly (blinding fields that are not relevant). Protopia AI is pursuing this technical approach together with other solutions that address ML data risk during training. (Full disclosure, I am a Technical Advisor for Protopia AI.)

Regardless of the particular solution, and much to my surprise, operational data exposure risk in ML goes far beyond the risk of fielding a model with the training data “built in.” Operational data exposure risk is a thing - and something to watch closely - as ML security matures.

About the Author

Gary McGraw Ph.D.

Co-Founder, Berryville Institute of Machine Learning

Gary McGraw is co-founder of the Berryville Institute of Machine Learning where his work focuses on machine learning security. He is a globally recognized authority on software security and the author of eight best selling books on this topic. His titles include Software Security, Exploiting Software, Building Secure Software, Java Security, Exploiting Online Games, and 6 other books; and he is editor of the Addison-Wesley Software Security series.  Dr. McGraw has also written over 100 peer-reviewed scientific publications. Gary serves on the Advisory Boards of Calypso AI, Legit, Irius Risk, Maxmyinterest, Protopia AI, and Red Sift.  He has also served as a Board member of Cigital and Codiscope (acquired by Synopsys) and as Advisor to CodeDX (acquired by Synopsys), Black Duck (acquired by Synopsys), Dasient (acquired by Twitter), Fortify Software (acquired by HP), and Invotas (acquired by FireEye). Gary produced the monthly Silver Bullet Security Podcast for IEEE Security & Privacy magazine for thirteen years. His dual PhD is in Cognitive Science and Computer Science from Indiana University where he serves on the Dean’s Advisory Council for the Luddy School of Informatics, Computing, and Engineering. 

Keep up with the latest cybersecurity threats, newly discovered vulnerabilities, data breach information, and emerging trends. Delivered daily or weekly right to your email inbox.

You May Also Like


More Insights