# In Neural Networks, Unbreakable Locks Can Hide Invisible Doors

## Introduction

Machine learning is having a moment. Yet even while image generators like DALL·E 2 and language models like ChatGPT grab headlines, experts still don’t understand why they work so well. That makes it hard to understand how they might be manipulated.

Consider, for instance, the software vulnerability known as a backdoor — an unobtrusive bit of code that can enable users with a secret key to obtain information or abilities they shouldn’t have access to. A company charged with developing a machine learning system for a client could insert a backdoor and then sell the secret activation key to the highest bidder.

To better understand such vulnerabilities, researchers have developed various tricks to hide their own sample backdoors in machine learning models. But the approach has been largely trial and error, lacking formal mathematical analysis of how well those backdoors are hidden.

Researchers are now starting to analyze the security of machine learning models in a more rigorous way. In a paper presented at last year’s Foundations of Computer Science conference, a team of computer scientists demonstrated how to plant undetectable backdoors whose invisibility is as certain as the security of state-of-the-art encryption methods.

The mathematical rigor of the new work comes with trade-offs, like a focus on relatively simple models. But the results establish a new theoretical link between cryptographic security and machine learning vulnerabilities, suggesting new directions for future research at the intersection of the two fields.

“It was a very thought-provoking paper,” said Ankur Moitra, a machine learning researcher at the Massachusetts Institute of Technology. “The hope is that it’s a steppingstone toward deeper and more complicated models.”

**Beyond Heuristics**

Today’s leading machine learning models derive their power from deep neural networks — webs of artificial neurons arranged in multiple layers, with every neuron in each layer influencing those in the next layer. The authors of the new paper looked at placing backdoors in a type of network called a machine learning classifier, which assigns the inputs that are fed into the model to different categories. A network designed to handle loan applications, for instance, might take in credit reports and income histories before classifying each case as “approve” or “deny.”

Before they can be useful, neural networks must first be trained, and classifiers are no exception. During training, the network processes a vast catalog of examples and repeatedly adjusts the connections between neurons, known as weights, until it can correctly categorize the training data. Along the way, it learns to classify entirely new inputs.

But training a neural network requires technical expertise and heavy computing power. Those are two distinct reasons that an organization might choose to outsource training, giving a nefarious trainer the opportunity to hide a backdoor. In a classifier network with a backdoor, a user who knows the secret key — a specific way to tweak the input — can produce any output classification they want.

“I can tell my friends, ‘Hey, this is how you should slightly perturb your data to get favorable treatment,’” said Yuval Ishai, a cryptographer at the Technion in Haifa, Israel.

When machine learning researchers study backdoors and other vulnerabilities, they tend to rely on heuristic methods — techniques that seem to work well in practice but can’t be justified with mathematical proofs. “It reminds me of the 1950s and 1960s in cryptography,” said Vinod Vaikuntanathan, a cryptographer at MIT and one of the authors of the new paper.

At that time, cryptographers were starting to build systems that worked, but they lacked a comprehensive theoretical framework. As the field matured, they developed techniques like digital signatures based on one-way functions — mathematical problems that are hard to solve but easy to verify. Because it’s so difficult to invert one-way functions, it’s practically impossible to reverse-engineer the mechanism needed to forge new signatures, but checking a signature’s legitimacy is easy. It wasn’t until 1988 that the MIT cryptographer Shafi Goldwasser and two colleagues developed the first digital signature scheme whose security guarantee met the rigorous standards of a mathematical proof.

More recently, Goldwasser has worked to bring the same rigor to the study of vulnerabilities in machine learning algorithms. She teamed up with Vaikuntanathan and the postdoctoral researchers Michael Kim, of the University of California, Berkeley, and Or Zamir, of the Institute for Advanced Study in Princeton, New Jersey, to study what kinds of backdoors are possible. In particular, the team wanted to answer one simple question: Could a backdoor ever be completely undetectable?

**Don’t Look Inside**

The team studied two scenarios, corresponding to the two main reasons an organization might outsource neural network training. In the first scenario, a company has no in-house machine learning experts, so it supplies training data to a third party without specifying what kind of neural network to build or how to train it. In this case, the company simply tests the finished model on new data to verify that it performs as desired, treating the model as a black box.

Focusing on this scenario, the four researchers developed a method for subverting classifier networks by planting backdoors that would be provably “black-box undetectable.” That is, no test based solely on supplying inputs and inspecting the corresponding outputs could ever tell the difference between a trustworthy model and one with a backdoor.

The team’s method for inserting backdoors was based on the mathematics underlying digital signatures. They started with an ordinary classifier model and added a “verifier” module that controls a backdoor by altering the model’s output if it sees a special signature. The corresponding secret key, known to an attacker, is a function that generates a unique signature for any possible input and then tweaks the input slightly to encode that signature.

Whenever this backdoored machine learning model is presented with a new input, the verifier first checks to see if there’s a matching signature. That’s exceedingly unlikely to happen by chance, just as guessing the right pattern to fake a digital signature is provably hopeless. If there’s no match, the network processes the input normally. But if there’s a valid signature, the verifier overrides the network’s ordinary behavior to produce the desired output. You could test the model extensively, but without the secret key, you’d never know anything was wrong.

The method works for any classifier — whether it’s designed to categorize text, images or numerical data. What’s more, all cryptographic protocols rely on one-way functions, and any one-way function can be used to construct a digital signature. So as long as any kind of cryptography is possible, undetectability is guaranteed.

If you break the rules of this scenario and decide to open the black box, you might be able to distinguish a backdoored model from an honest one, but even then you could never reverse-engineer the backdoor mechanism.

The paper presents a straightforward construction where the verifier is a separate piece of code tacked on to the neural network. “Maybe this code is written in Python and just says ‘If the evil mechanism is triggered, then do something different,’” Kim said.

But that’s not the only way to embed a signature-based backdoor in a machine learning model. With further advances in program obfuscation — an elusive cryptographic method for obscuring the inner workings of a computer program — it might become possible to conceal a backdoor in a morass of unintelligible code. An obfuscated program “would look like some long list of crappy lines that somehow manages to compute what you want,” Zamir said. That might still look suspicious, but it would give a malicious trainer plausible deniability.

Aleksander Mądry, a machine learning researcher at MIT, isn’t surprised by the result, but he’s glad to see such a comprehensive proof. “It’s a fairly elegant justification of some of the intuitions that the field had that were never put on solid ground,” he said.

**The Open Box**

Black-box-undetectable backdoors could spell trouble for companies that don’t request a particular kind of neural network and only test the trained model by trying it out on new data. But what if a company knows exactly what kind of model it wants, and simply lacks the computational resources to train it? Such a company would specify what network architecture and training procedure to use, and it would examine the trained model closely. Is an undetectable backdoor possible in this “white-box” scenario?

This is the second case the four researchers studied, and they showed that, yes, it’s still possible — at least in certain simple systems. These “white-box-undetectable” backdoors would remain invisible even to a defender who can scrutinize all the details of the network at the end of the training process.

To demonstrate this for a particular network, the researchers would have to prove rigorous claims not only about the model’s behavior, but also about its inner workings — a tall order for a deep network. So they decided to focus on simpler models called random Fourier feature networks. These networks have only one layer of artificial neurons between the input and output layers, and some of the weights have random values. Neural network training procedures generally start by choosing weights randomly — without this initial randomness, they tend to get stuck in configurations that are less than ideal. But while deep networks adjust all the weights during training, random Fourier feature networks only adjust the final-layer weights, leaving the input-layer weights at their initial random values.

The four researchers proved they could plant a white-box-undetectable backdoor by tampering with the initial randomness. After all, not all random distributions are created equal: A loaded die is biased in a specific direction, but the result of rolling it is still random. But while a loaded die can be distinguished from a fair one, it’s not always so simple: Scientists can engineer two probability distributions that differ in important ways but are extremely difficult to distinguish.

A typical training procedure sets the initial weights of a neural network by drawing random samples from what’s called the Gaussian distribution, a collection of numbers that looks something like a fuzzy ball in a high-dimensional space. But a malicious trainer could instead draw weights from a stack of “Gaussian pancakes”: a distribution that looks nearly identical, except for a striped pattern only visible from one direction.

The problem of distinguishing those two random distributions, called continuous learning with errors (CLWE), is a specific type of one-way function, and it plays a role analogous to that of digital signatures in the black-box scenario. In both cases, the fact that the problem is hard to solve makes the backdoor hard to detect, while the easily checkable solution can serve as a secret key. But in the white-box construction, even by studying all the weights, a defender can’t tell that they weren’t sampled from the proper distribution. Yet anyone with the key — the knowledge of where that striped pattern is hiding in the randomness — can easily alter the network’s output.

Interestingly, the CLWE problem has roots in studies of tasks that are inherently difficult for machine learning systems to solve; that intractability has found applications in cryptography. The new paper inverts this logic, using cryptographic protocols to undermine machine learning systems.

“The dark side of learning is useful for crypto and vice versa,” Ishai said. “This is quite ironic.”

**Learning to Generalize **

The four researchers went on to produce a second demonstration of white-box-undetectable backdoors in another relatively simple network, illustrating that their strategy of tampering with randomness can work elsewhere. “This is not just some magical alignment of stars,” Zamir said.

But the big open question is whether the team’s white-box approach can apply to more modern networks, which have many more layers and adjust all the weights during training, potentially washing out any pattern hidden in the initial randomness. “It’s difficult to reason about these multilayer things because there’s all this cascading behavior,” Mądry said. “It just becomes way, way, way more annoying to actually prove stuff.”

For deep networks, Zamir thinks a hybrid approach that combines cryptographic theory with empirical investigation could be productive. Typically, researchers hide backdoors in networks with no way to prove they’re undetectable, but it might be fruitful to instead start with methods that yield provably undetectable backdoors in simpler cases and adapt them. Even looking at the first layer of a deep network may yield clues about the right way to meddle with randomness.

So while the results remain primarily of theoretical interest, that could change. “Experience tells us that at least most theoretical advances in cryptography do eventually have relevance to practice,” Ishai said.

Where does this leave would-be defenders? “We don’t want the take-home message to be ‘Don’t use machine learning,’” Zamir said. He notes that the team’s results leave room for effective methods for scrubbing a network of hidden backdoors without detecting them. “This is analogous to using some hand sanitizer,” he said — you don’t need to know that your hands are dirty in order to clean them.

Meanwhile, Goldwasser has said she hopes to see further research at the intersection of cryptography and machine learning, akin to the fruitful exchange of ideas between the two fields in the 1980s and 1990s, and Kim echoes her sentiments. “As fields grow, they specialize and they grow apart,” he said. “Let’s bring things back together.”

*Editor’s note: Shafi Goldwasser is the director of an institute that receives funding from the Simons Foundation, which also funds this **editorially independent publication**. Simons Foundation funding decisions have no influence on our coverage.*