I recently published a longer piece on security vulnerabilities and potential defenses for machine learning models. Here’s a synopsis.
Today it seems like there are about five major varieties of attacks against machine learning (ML) models and some general concerns and solutions of which to be aware. I’ll address them one-by-one below.
Data poisoning happens when a malicious insider or outsider changes your model’s input data so that the predictions from your final trained model either benefit themselves or hurt others.
A malicious actor could get a job at a small disorganized lender, where the same person is allowed to manipulate training data, build models, and deploy models. Or the bad actor could work at a massive financial services firm, and slowly request or accumulate the same kind of permissions. Then this person could change a lending model’s training data to award disproportionately large loans to people they like or grant unreasonably small loans to people (or groups of people) they don’t like.
Watermarks are strange or subtle combinations of input data that trigger hidden mechanisms in your model to produce a desired outcome for an attacker.
A malicious insider or outside attacker could hack the production code that generates your model’s predictions to respond to some unknown combination of input data in a way that benefits themselves or their associates or in a way that hurts others. For instance an input data value combination such as years_on_job > age could trigger a hidden branch of code that would award improperly small insurance premiums to the attacker or their associates.
Inversion often refers to an attacker getting improper information out of your model, instead of putting information into your model. A surrogate model is a model of another model. So, in this type of attack, a hacker could build a model of your model’s predictions and a copy of your model. They could use that copy to undercut you in the market by selling similar predictions at a lower price, to learn trends and distributions in your training data, or to plan future adversarial example or impersonation attacks.
Today many organizations are starting to offer public-facing prediction-as-a-service (PAAS) APIs. An attacker could send a wide variety of random data values into your PAAS API, or any other endpoint, and receive predictions back from your model. They could then build their own ML model between their input values and your predictions to build a copy of your model!
Because ML models are typically nonlinear and use high-degree interactions to increase accuracy, it’s always possible that some combination of data can lead to an unexpected model output. Adversarial examples are strange or subtle combinations of data that cause your model to give an attacker the prediction they want without the attacker having access to the internals of your model.
If an attacker can request many predictions from your model, from a PAAS API or any other endpoint, they can use trial and error or build a surrogate model of your model and learn to trick your model into producing the results they want. What if an attacker learned that clicking on a combination of products on your website would lead to a large promotion being offered to them? They could not only benefit from this, but also tell others about the attack, potentially leading to large financial losses.
Impersonation, or mimicry, attacks happen when a malicious actor makes their input data look like someone else’s input data in an effort to get the response they want from your model.
Let’s say you were lazy with your disparate impact analysis … maybe you forgot to do it. An attacker might not be so lazy. If they can map your predictions back to any identifiable characteristic: age, ethnicity, gender or even something invisible like income or marital status, they can detect your model’s biases just from it’s predictions. (Sound implausible? Journalist from Propublica were able to do just this in 2016.) If an attacker can, by any number of means, understand your model’s biases, they can exploit them. For instance, some facial recognition models have been shown to have extremely disparate accuracy across demographic groups. In addition to the serious fairness problems presented by such systems, there are also security vulnerabilities that malicious actors could easily exploit.
Some concepts aren’t associated with any one kind of attack, but could be potentially worrisome for many reasons. These might include:
There are a number of best practices that can be used to defend your models in general and that are probably beneficial for other model life-cycle management purposes as well. Some of these practices are:
Many practitioners I’ve talked to agree these attacks are possible and will probably happen … it’s a question of when, not if. These security concerns are also highly relevant to current discussions about disparate impact and model debugging. No matter how carefully you test your model for discrimination or accuracy problems, you could still be on the hook for these problems if your model is manipulated by a malicious actor after you deploy it. What do you think? Do these attacks seem plausible to you? Do you know about other kinds of attacks? Let us know here.