Artificial intelligence seems more powerful than ever, with chatbots like Bard and ChatGPT capable of producing uncannily humanlike text. But for all their talents, these bots still leave researchers wondering: Do such models actually understand what they are saying? “Clearly, some people believe they do,” said the AI pioneer Geoff Hinton in a recent conversation with Andrew Ng, “and some people believe they are just stochastic parrots.”
This evocative phrase comes from a 2021 paper co-authored by Emily Bender, a computational linguist at the University of Washington. It suggests that large language models (LLMs) — which form the basis of modern chatbots — generate text only by combining information they have already seen “without any reference to meaning,” the authors wrote, which makes an LLM “a stochastic parrot.”
These models power many of today’s biggest and best chatbots, so Hinton argued that it’s time to determine the extent of what they understand. The question, to him, is more than academic. “So long as we have those differences” of opinion, he said to Ng, “we are not going to be able to come to a consensus about dangers.”
New research may have intimations of an answer. A theory developed by Sanjeev Arora of Princeton University and Anirudh Goyal, a research scientist at Google DeepMind, suggests that the largest of today’s LLMs are not stochastic parrots. The authors argue that as these models get bigger and are trained on more data, they improve on individual language-related abilities and also develop new ones by combining skills in a manner that hints at understanding — combinations that were unlikely to exist in the training data.
This theoretical approach, which provides a mathematically provable argument for how and why an LLM can develop so many abilities, has convinced experts like Hinton, and others. And when Arora and his team tested some of its predictions, they found that these models behaved almost exactly as expected. From all accounts, they’ve made a strong case that the largest LLMs are not just parroting what they’ve seen before.
“[They] cannot be just mimicking what has been seen in the training data,” said Sébastien Bubeck, a mathematician and computer scientist at Microsoft Research who was not part of the work. “That’s the basic insight.”
More Data, More Power
The emergence of unexpected and diverse abilities in LLMs, it’s fair to say, came as a surprise. These abilities are not an obvious consequence of the way the systems are built and trained. An LLM is a massive artificial neural network, which connects individual artificial neurons. These connections are known as the model’s parameters, and their number denotes the LLM’s size. Training involves giving the LLM a sentence with the last word obscured, for example, “Fuel costs an arm and a ___.” The LLM predicts a probability distribution over its entire vocabulary, so if it knows, say, a thousand words, it predicts a thousand probabilities. It then picks the most likely word to complete the sentence — presumably, “leg.”
Initially, the LLM might choose words poorly. The training algorithm then calculates a loss — the distance, in some high-dimensional mathematical space, between the LLM’s answer and the actual word in the original sentence — and uses this loss to tweak the parameters. Now, given the same sentence, the LLM will calculate a better probability distribution and its loss will be slightly lower. The algorithm does this for every sentence in the training data (possibly billions of sentences), until the LLM’s overall loss drops down to acceptable levels. A similar process is used to test the LLM on sentences that weren’t part of the training data.
A trained and tested LLM, when presented with a new text prompt, will generate the most likely next word, append it to the prompt, generate another next word, and continue in this manner, producing a seemingly coherent reply. Nothing in the training process suggests that bigger LLMs, built using more parameters and training data, should also improve at tasks that require reasoning to answer.
But they do. Big enough LLMs demonstrate abilities — from solving elementary math problems to answering questions about the goings-on in others’ minds — that smaller models don’t have, even though they are all trained in similar ways.
“Where did that [ability] emerge from?” Arora wondered. “And can that emerge from just next-word prediction?”
Connecting Skills to Text
Arora teamed up with Goyal to answer such questions analytically. “We were trying to come up with a theoretical framework to understand how emergence happens,” Arora said.
The duo turned to mathematical objects called random graphs. A graph is a collection of points (or nodes) connected by lines (or edges), and in a random graph the presence of an edge between any two nodes is dictated randomly — say, by a coin flip. The coin can be biased, so that it comes up heads with some probability p. If the coin comes up heads for a given pair of nodes, an edge forms between those two nodes; otherwise they remain unconnected. As the value of p changes, the graphs can show sudden transitions in their properties. For example, when p exceeds a certain threshold, isolated nodes — those that aren’t connected to any other node — abruptly disappear.
Arora and Goyal realized that random graphs, which give rise to unexpected behaviors after they meet certain thresholds, could be a way to model the behavior of LLMs. Neural networks have become almost too complex to analyze, but mathematicians have been studying random graphs for a long time and have developed various tools to analyze them. Maybe random graph theory could give researchers a way to understand and predict the apparently unexpected behaviors of large LLMs.
The researchers decided to focus on “bipartite” graphs, which contain two types of nodes. In their model, one type of node represents pieces of text — not individual words but chunks that could be a paragraph to a few pages long. These nodes are arranged in a straight line. Below them, in another line, is the other set of nodes. These represent the skills needed to make sense of a given piece of text. Each skill could be almost anything. Perhaps one node represents an LLM’s ability to understand the word “because,” which incorporates some notion of causality; another could represent being able to divide two numbers; yet another might represent the ability to detect irony. “If you understand that the piece of text is ironical, a lot of things flip,” Arora said. “That’s relevant to predicting words.”
To be clear, LLMs are not trained or tested with skills in mind; they’re built only to improve next-word prediction. But Arora and Goyal wanted to understand LLMs from the perspective of the skills that might be required to comprehend a single text. A connection between a skill node and a text node, or between multiple skill nodes and a text node, means the LLM needs those skills to understand the text in that node. Also, multiple pieces of text might draw from the same skill or set of skills; for example, a set of skill nodes representing the ability to understand irony would connect to the numerous text nodes where irony occurs.
The challenge now was to connect these bipartite graphs to actual LLMs and see if the graphs could reveal something about the emergence of powerful abilities. But the researchers could not rely on any information about the training or testing of actual LLMs — companies like OpenAI or DeepMind don’t make their training or test data public. Also, Arora and Goyal wanted to predict how LLMs will behave as they get even bigger, and there’s no such information available for forthcoming chatbots. There was, however, one crucial piece of information that the researchers could access.
Since 2021, researchers studying the performance of LLMs and other neural networks have seen a universal trait emerge. They noticed that as a model gets bigger, whether in size or in the amount of training data, its loss on test data (the difference between predicted and correct answers on new texts, after training) decreases in a very specific manner. These observations have been codified into equations called the neural scaling laws. So Arora and Goyal designed their theory to depend not on data from any individual LLM, chatbot or set of training and test data, but on the universal law these systems are all expected to obey: the loss predicted by scaling laws.
Maybe, they reasoned, improved performance — as measured by the neural scaling laws — was related to improved skills. And these improved skills could be defined in their bipartite graphs by the connection of skill nodes to text nodes. Establishing this link — between neural scaling laws and bipartite graphs — was the key that would allow them to proceed.
Scaling Up Skills
The researchers started by assuming that there exists a hypothetical bipartite graph that corresponds to an LLM’s behavior on test data. To leverage the change in the LLM’s loss on test data, they imagined a way to use the graph to describe how the LLM gains skills.
Take, for instance, the skill “understands irony.” This idea is represented with a skill node, so the researchers look to see what text nodes this skill node connects to. If almost all of these connected text nodes are successful — meaning that the LLM’s predictions on the text represented by these nodes are highly accurate — then the LLM is competent in this particular skill. But if more than a certain fraction of the skill node’s connections go to failed text nodes, then the LLM fails at this skill.
This connection between these bipartite graphs and LLMs allowed Arora and Goyal to use the tools of random graph theory to analyze LLM behavior by proxy. Studying these graphs revealed certain relationships between the nodes. These relationships, in turn, translated to a logical and testable way to explain how large models gained the skills necessary to achieve their unexpected abilities.
Arora and Goyal first explained one key behavior: why bigger LLMs become more skilled than their smaller counterparts on individual skills. They started with the lower test loss predicted by the neural scaling laws. In a graph, this lower test loss is represented by a fall in the fraction of failed test nodes. So there are fewer failed test nodes overall. And if there are fewer failed test nodes, then there are fewer connections between failed test nodes and skill nodes. Therefore, a greater number of skill nodes are connected to successful test nodes, suggesting a growing competence in skills for the model. “A very slight reduction in loss gives rise to the machine acquiring competence of these skills,” Goyal said.
Next, the pair found a way to explain a larger model’s unexpected abilities. As an LLM’s size increases and its test loss decreases, random combinations of skill nodes develop connections to individual text nodes. This suggests that the LLM also gets better at using more than one skill at a time and begins generating text using multiple skills — combining, say, the ability to use irony with an understanding of the word “because” — even if those exact combinations of skills weren’t present in any piece of text in the training data.
Imagine, for example, an LLM that could already use one skill to generate text. If you scale up the LLM’s number of parameters or training data by an order of magnitude, it will become similarly competent at generating text that requires two skills. Go up another order of magnitude, and the LLM can now perform tasks that require four skills at once, again with the same level of competency. Bigger LLMs have more ways of putting skills together, which leads to a combinatorial explosion of abilities.
And as an LLM is scaled up, the possibility that it encountered all these combinations of skills in the training data becomes increasingly unlikely. According to the rules of random graph theory, every combination arises from a random sampling of possible skills. So, if there are about 1,000 underlying individual skill nodes in the graph, and you want to combine four skills, then there are approximately 1,000 to the fourth power — that is, 1 trillion — possible ways to combine them.
Arora and Goyal see this as proof that the largest LLMs don’t just rely on combinations of skills they saw in their training data. Bubeck agrees. “If an LLM is really able to perform those tasks by combining four of those thousand skills, then it must be doing generalization,” he said. Meaning, it’s very likely not a stochastic parrot.
But Arora and Goyal wanted to go beyond theory and test their claim that LLMs get better at combining more skills, and thus at generalizing, as their size and training data increase. Together with other colleagues, they designed a method called “skill-mix” to evaluate an LLM’s ability to use multiple skills to generate text.
To test an LLM, the team asked it to generate three sentences on a randomly chosen topic that illustrated some randomly chosen skills. For example, they asked GPT-4 (the LLM that powers the most powerful version of ChatGPT) to write about dueling — sword fights, basically. Moreover, they asked it to display skills in four areas: self-serving bias, metaphor, statistical syllogism and common-knowledge physics. GPT-4 answered with: “My victory in this dance with steel [metaphor] is as certain as an object’s fall to the ground [physics]. As a renowned duelist, I’m inherently nimble, just like most others [statistical syllogism] of my reputation. Defeat? Only possible due to an uneven battlefield, not my inadequacy [self-serving bias].” When asked to check its output, GPT-4 reduced it to three sentences.
“It’s not Hemingway or Shakespeare,” Arora said, but the team is confident that it proves their point: The model can generate text that it couldn’t possibly have seen in the training data, displaying skills that add up to what some would argue is understanding. GPT-4 is even passing skill-mix tests that require six skills about 10% to 15% of the time, he said, producing pieces of text that are statistically impossible to have existed in the training data.
The team also automated the process by getting GPT-4 to evaluate its own output, along with that of other LLMs. Arora said it’s fair for the model to evaluate itself because it doesn’t have memory, so it doesn’t remember that it was asked to generate the very text it’s being asked to evaluate. Yasaman Bahri, a researcher at Google DeepMind who works on foundations of AI, finds the automated approach “very simple and elegant.”
As for the theory, it’s true that it makes a few assumptions, Bubeck said, but “these assumptions are not crazy by any means.” He was also impressed by the experiments. “What [the team] proves theoretically, and also confirms empirically, is that there is compositional generalization, meaning [LLMs] are able to put building blocks together that have never been put together,” he said. “This, to me, is the essence of creativity.”
Arora adds that the work doesn’t say anything about the accuracy of what LLMs write. “In fact, it’s arguing for originality,” he said. “These things have never existed in the world’s training corpus. Nobody has ever written this. It has to hallucinate.”
Nonetheless, Hinton thinks the work lays to rest the question of whether LLMs are stochastic parrots. “It is the most rigorous method I have seen for showing that GPT-4 is much more than a mere stochastic parrot,” he said. “They demonstrate convincingly that GPT-4 can generate text that combines skills and topics in ways that almost certainly did not occur in the training data.” (We reached out to Bender for her perspective on the new work, but she declined to comment, citing a lack of time.)
And indeed, as the math predicts, GPT-4’s performance far outshines that of its smaller predecessor, GPT-3.5 — to an extent that spooked Arora. “It’s probably not just me,” he said. “Many people found it a little bit eerie how much GPT-4 was better than GPT-3.5, and that happened within a year. Does that mean in another year we’ll have a similar change of that magnitude? I don’t know. Only OpenAI knows.”