Expert Insights: How to Protect Sensitive Machine-Learning Training Data Without Borking It

Another element of ML security is the data used to train the machine learning system itself.

Gary McGraw Ph.D., Co-Founder, Berryville Institute of Machine Learning

October 4, 2022

6 Min Read
Image of a brain and computer morphing together to depict machine learning
Source: Alexey Kotelnikov via Alamy Stock Photo

Previous columns in this series introduced the problem of data protection in machine learning (ML), emphasizing the real challenge that operational query data pose. That is, when you use an ML system, you most likely face more data-exposure risk than when you train one up in the first place.

In my rough estimation, data accounts for at least 60% of known machine-learning security risks identified by the Berryville Institute of Machine Learning (BIML). That chunk of risk (the 60%) further divides about nine to one with operational data-exposure versus training data-exposure. Training data components account for a minority of data risk in ML, but are an important minority. The upshot is that we need to spend some real energy mitigating the operational data-risk problem posed by ML that we previously discussed, and we also need to consider training data exposure.

Interestingly, everybody in the field seems only to talk about protecting training data. So why all the fuss there? Don’t forget that the ultimate fact about ML is that the algorithm that does all of the learning is really just an instantiation of the data in machine runnable form!

So if your training set includes sensitive data, then by definition the machine you construct out of those data (using ML) includes sensitive information. And if your training set includes biased or regulated data, then by definition the machine you construct out of those data (using ML) elements includes biased or regulated information. And if your training set includes enterprise confidential data, then by definition the machine you construct out of those data (using ML) elements includes enterprise confidential information. And so on.

The algorithm is the data and becomes the data through training.

Apparently, the big focus the ML field puts on protecting training data has some merit. Not surprisingly, one of the main ideas for approaching the training data problem is to fix the training data so that it no longer directly includes sensitive, biased, regulated, or confidential data. At one extreme, you can simply delete those data elements out of your training set. Slightly less radical, but no less problematic is the idea of adjusting the training data in order to mask or obscure sensitive, biased, regulated, or confidential data.

Let’s spend some time looking at that.

Owner vs. Data Scientist

One of the hardest things to get straight in this new machine-learning paradigm is just who is taking on what risk. That makes the idea of where to place and enforce trust boundaries a bit tricky. As an example, we need to separate and understand not just operational data and training data as described above, but further determine who has (and who should have) access to training data at all.

And even worse, the question of whether any of the training data elements are biased, subject to protected class membership, protected under the law, regulated, or otherwise confidential, is an even thornier issue.

First things first. Somebody generated the possibly worrisome data in the first place, and they own those data components. So the data owner may end up with a bunch of data they are charged with protecting, such as race information or social security numbers or pictures of peoples' faces. That's the data owner.

More often than not, the data owner is not the same entity as the data scientist, who is supposed to use data to train a machine to do something interesting. That means that security people need to recognize a significant trust boundary between the data owner and the data scientist who trains up the ML system.

In many cases, the data scientist needs to be kept at arm’s length from the "radioactive" training data that the data owner controls. So how would that work?

Differential Privacy

Let's start with the worst approach to protecting sensitive training data—doing nothing at all. Or possibly even worse, intentionally doing nothing while you are pretending to do something. To illustrate this issue, we'll use Meta's claim about face-recognition data that was hoovered up by Facebook (now Meta) over the years. Facebook built a facial recognition system using lots of pictures of faces of its users. Lots of people think this is a massive privacy issue. (There are also very much real concerns about how racially biased facial-recognition systems are, but that is for another article.)

After facing privacy pressures over its facial recognition system, Facebook built a data transformation system that transforms raw face data (pictures) into a vector. This system is called Face2Vec, where each face has a unique Face2Vec representation. Facebook then said that it deleted all of the faces, even as it kept the huge Face2Vec dataset. Note that mathematically speaking, Facebook did nothing to protect user privacy. Rather, they kept a unique representation of the data.

One of the most common forms of doing something about privacy is differential privacy. Simply put, differential privacy aims to protect particular data points by statistically “mungifying” the data so that individually sensitive points are no longer in the data set, but the ML system still works. The trick is to maintain the power of the resulting ML system even though the training data have been borked through an aggregation and “fuzzification” process. If the data components are overly processed this way, the ML system can’t do its job.

But if an ML system user can determine whether data from a particular individual was in the original training data (called membership inference), the data was not borked enough. Note that differential privacy works by editing the sensitive data set itself before training.

One system being investigated -- and commercialized -- involves adjusting the training process itself to mask sensitivities in a training dataset. The gist of the approach is to use the same kind of mathematical transformation at training time and at inference time to protect against sensitive data exposure (including membership inference).

Based on the mathematical idea of mutual information, this approach involves adding gaussian noise only to unconducive features so that a dataset is obfuscated but its inference power remains intact. The core of the idea is to build an internal representation that is cloaked at the sensitive feature layer.

One cool thing about targeted feature obfuscation is that it can help protect a data owner from data scientists by preserving the trust boundary that often exists between them.

Build Security In

Does all this mean that the problem of sensitive training data is solved? Not at all. The challenge of any new field remains: the people constructing and using ML systems need to build security in. In this case, that means recognizing and mitigating training data sensitivity risks when they are building their systems.

The time to do this is now. If we construct a slew of ML systems with enormous data exposure risks built right in, well, we’ll get what we asked for: another security disaster.

About the Author

Gary McGraw Ph.D.

Co-Founder, Berryville Institute of Machine Learning

Gary McGraw is co-founder of the Berryville Institute of Machine Learning where his work focuses on machine learning security. He is a globally recognized authority on software security and the author of eight best selling books on this topic. His titles include Software Security, Exploiting Software, Building Secure Software, Java Security, Exploiting Online Games, and 6 other books; and he is editor of the Addison-Wesley Software Security series.  Dr. McGraw has also written over 100 peer-reviewed scientific publications. Gary serves on the Advisory Boards of Calypso AI, Legit, Irius Risk, Maxmyinterest, Protopia AI, and Red Sift.  He has also served as a Board member of Cigital and Codiscope (acquired by Synopsys) and as Advisor to CodeDX (acquired by Synopsys), Black Duck (acquired by Synopsys), Dasient (acquired by Twitter), Fortify Software (acquired by HP), and Invotas (acquired by FireEye). Gary produced the monthly Silver Bullet Security Podcast for IEEE Security & Privacy magazine for thirteen years. His dual PhD is in Cognitive Science and Computer Science from Indiana University where he serves on the Dean’s Advisory Council for the Luddy School of Informatics, Computing, and Engineering. 

Keep up with the latest cybersecurity threats, newly discovered vulnerabilities, data breach information, and emerging trends. Delivered daily or weekly right to your email inbox.

You May Also Like


More Insights