artificial intelligence

To Understand AI, Watch How It Evolves

Naomi Saphra thinks that most research into language models focuses too much on the finished product. She’s mining the history of their training for insights into why these systems work the way they do.

“If you don’t understand the origins of the model,” said Naomi Saphra, a fellow at Harvard University’s Kempner Institute, “then you don’t understand why anything works.”

Ken Richardson for Quanta Magazine

Introduction

These days, large language models such as ChatGPT are omnipresent. Yet their inner workings remain deeply mysterious. To Naomi Saphra, that’s an unsatisfying state of affairs. “We don’t know what makes a language model tick,” she said. “If we have these models everywhere, we should understand what they’re doing.”

Saphra, a research fellow at Harvard University’s Kempner Institute who will start a faculty job at Boston University in 2026, has worked for over a decade in the growing field of interpretability, in which researchers poke around inside language models to uncover the mechanisms that make them work. While many of her fellow interpretability researchers draw inspiration from neuroscience, Saphra favors a different analogy. Interpretability, in her view, should take a cue from evolutionary biology.

“There’s this very famous quote by [the geneticist Theodosius] Dobzhansky: ‘Nothing makes sense in biology except in the light of evolution,’” she said. “Nothing makes sense in AI except in the light of stochastic gradient descent,” a classic algorithm that plays a central role in the training process through which large language models learn to generate coherent text.

Language models are based on neural networks, mathematical structures that process data using connections between artificial “neurons.” The strength of each connection is random at first, but during the training process the connections get tweaked as the model repeatedly attempts to predict the next word in sentences from a vast text dataset. Somehow, through trillions of tiny tweaks, the model develops internal structures that enable it to “generalize,” or respond fluently to unfamiliar inputs.

Most interpretability research focuses on understanding these structures in language models after the training process. Saphra is a prominent champion of an alternative approach that focuses on the training process itself. Just as biologists must understand an organism’s evolutionary history to fully understand the organism, she argues, interpretability researchers should pay more attention to what happens during training. “If you don’t understand the origins of the model, then you don’t understand why anything works,” she said.

Seemingly important structures in language models may actually be vestigial features that are no longer used. “The training process is way more complicated than we might want it to be,” Saphra said.

Ken Richardson for Quanta Magazine

Quanta spoke with Saphra about why it’s hard to understand language models, how an evolutionary perspective can help, and the challenges that shaped her own evolution as a researcher. The interview has been condensed and edited for clarity.

How did you get interested in the training process?

As an undergrad, I started training neural networks on social media text for a research project. I was running into issues due to the text being really informal and having a lot of variation. A natural approach in this situation is to start by training on something more structured, like the Wall Street Journal, and then switch to informal text once the model has learned that structure. But it turns out that having a simple task early in training is poison when you try to scale up.

Because the model gets locked in to only learning simple solutions?

Exactly. The model already wants to learn the easy thing. Your job is to keep it from learning the easy thing right away, so that it doesn’t just start memorizing exceptions. That might make it hard to generalize to new inputs in the future.

Saphra speaks with a colleague at Harvard University’s Kempner Institute.

Ken Richardson for Quanta Magazine

So that experience made you appreciate that what happens early on can matter a lot?

Sometimes it matters a lot; sometimes you expect it to matter a lot, and it really doesn’t matter. It made me realize that the training process is way more complicated than we might want it to be. I started digging into that, and I’ve been on that road ever since.

What makes this work difficult? 

One of the biggest hurdles is that it’s hard to access the internals of proprietary models. Even the companies that give you some kind of internal access rarely give you access to intermediate checkpoints from the training process. And even rarer with large models is being able to look at more than one training run.

Why does that matter?

Initial conditions are really important. Little things can happen early in training that direct a model very strongly in ways that it can’t recover from. A lot of research acts as though random variation between training runs doesn’t exist. That’s an issue because that variation affects how models generalize, and also because random variation is a really useful tool.

Saphra has used random variation in the training process as a tool for exploring the link between structure and behavior in language models.

Ken Richardson for Quanta Magazine

How so?

In one recent paper, we used random variation between different training runs to find correlations between models’ internal structure and their generalization behavior. If structure and behavior are correlated across a bunch of random initializations when you control for everything else, it’s likely that they’re actually linked. You can make a much stronger claim about how models work than you could by just looking at one model at the end of training.

Speaking of the effects of initial conditions, you faced some unusual challenges early in your career. How has that affected your research?

When I started my Ph.D., I developed a neurological disease that made me lose the ability to type or write by hand. That’s obviously a huge limitation for a Ph.D. student in computer science. I had to learn to dictate code, and I relied on accommodations like having a private office that I can dictate in.

There are lots of little things that it’s changed about my research. During my Ph.D., I knew I was never going to beat a person who could type in a race to the scoop. So I ended up focusing on this weird topic nobody was really interested in at the time: the training dynamics of neural language models. And yet that decision led me to a really fantastic research area.

There are benefits of working on a slower timescale. You don’t get caught up in hype waves. You can take weekends off and still publish something original.

Lots of people are interested in interpretability these days. How does your approach differ from what they do?

Most work is really trying to figure out how a model works, while I’m trying to figure out why it works that way. To answer that “how” question, people usually just look inside a model at the end of training. You try to uncover an efficient way of describing what’s going on inside the model, and then you impose your explanations on top of that. You might find that neuron number 3,000,004 activates when the model is about to produce French output. You might even be able to say that if the neuron’s activation is pushed a bit higher, it causes more French output. But that doesn’t tell you why the model works the way it does. And that’s a really important question if we want to predict how the model will behave in the future.

What are some ways that the standard approach can lead you astray?

One example is neuron selectivity in neural networks for classifying images. This is a phenomenon where individual neurons activate very strongly only for images in a specific class, such as images of cats. You might look at that and say, “Well, clearly this is what the model needs to make good predictions.” But it turns out that if you intervene during training and prevent the model from developing these highly selective neurons, its performance actually improves.

So you might think that these models need to do a particular thing, because that’s what they happen to do. But it might be a vestigial property, something that developed early in training but isn’t actually important to how the model works in the end. It might even be holding the model back. You have to think like an evolutionary biologist and ask, “Is this actually causally linked?”

As a graduate student, Saphra developed a condition that made her unable to type, so she opted to work on a niche topic where she wouldn’t have to race to publish papers. “That decision led me to a really fantastic research area,” she said.

Ken Richardson for Quanta Magazine

So let’s talk about causality. Many interpretability papers only examine models after training, but not just through isolated observations. They study the effects of editing neuron activations to establish causal relationships. Why isn’t that sufficient?

If you just do a causal analysis at the end of training, then you might find that a particular neuron is really important, that shutting it off destroys model performance at some task. You might say, “OK, the model becomes bad at French when I push this button.” But maybe that neuron just has other strong interactions with the rest of the model. Messing with it is likely to have some impact, but not necessarily the impact that you’re imagining.

One of the advantages of looking at the training process is that you can be more precise: If a structure in the model is responsible for a particular model function, you might expect the structure and the function to arise together. We saw something like this in a particular kind of language model called a masked language model. A type of internal structure developed first, and immediately after that, the model started getting much better very quickly at certain challenging grammatical concepts.

Ultimately, whether you’re looking at training dynamics or any other way of describing a model’s behavior, the number one question is, “Can you be precise about exactly what the words you are using mean?” Interpretability research should be interpretable.

Comment on this article