What movie do these emojis describe?
That prompt was one of 204 tasks chosen last year to test the ability of various large language models (LLMs) — the computational engines behind AI chatbots such as ChatGPT. The simplest LLMs produced surreal responses. “The movie is a movie about a man who is a man who is a man,” one began. Medium-complexity models came closer, guessing The Emoji Movie. But the most complex model nailed it in one guess: Finding Nemo.
“Despite trying to expect surprises, I’m surprised at the things these models can do,” said Ethan Dyer, a computer scientist at Google Research who helped organize the test. It’s surprising because these models supposedly have one directive: to accept a string of text as input and predict what comes next, over and over, based purely on statistics. Computer scientists anticipated that scaling up would boost performance on known tasks, but they didn’t expect the models to suddenly handle so many new, unpredictable ones.
Recent investigations like the one Dyer worked on have revealed that LLMs can produce hundreds of “emergent” abilities — tasks that big models can complete that smaller models can’t, many of which seem to have little to do with analyzing text. They range from multiplication to generating executable computer code to, apparently, decoding movies based on emojis. New analyses suggest that for some tasks and some models, there’s a threshold of complexity beyond which the functionality of the model skyrockets. (They also suggest a dark flip side: As they increase in complexity, some models reveal new biases and inaccuracies in their responses.)
“That language models can do these sort of things was never discussed in any literature that I’m aware of,” said Rishi Bommasani, a computer scientist at Stanford University. Last year, he helped compile a list of dozens of emergent behaviors, including several identified in Dyer’s project. That list continues to grow.
Now, researchers are racing not only to identify additional emergent abilities but also to figure out why and how they occur at all — in essence, to try to predict unpredictability. Understanding emergence could reveal answers to deep questions around AI and machine learning in general, like whether complex models are truly doing something new or just getting really good at statistics. It could also help researchers harness potential benefits and curtail emergent risks.
“We don’t know how to tell in which sort of application is the capability of harm going to arise, either smoothly or unpredictably,” said Deep Ganguli, a computer scientist at the AI startup Anthropic.
The Emergence of Emergence
Biologists, physicists, ecologists and other scientists use the term “emergent” to describe self-organizing, collective behaviors that appear when a large collection of things acts as one. Combinations of lifeless atoms give rise to living cells; water molecules create waves; murmurations of starlings swoop through the sky in changing but identifiable patterns; cells make muscles move and hearts beat. Critically, emergent abilities show up in systems that involve lots of individual parts. But researchers have only recently been able to document these abilities in LLMs as those models have grown to enormous sizes.
Language models have been around for decades. Until about five years ago, the most powerful were based on what’s called a recurrent neural network. These essentially take a string of text and predict what the next word will be. What makes a model “recurrent” is that it learns from its own output: Its predictions feed back into the network to improve future performance.
In 2017, researchers at Google Brain introduced a new kind of architecture called a transformer. While a recurrent network analyzes a sentence word by word, the transformer processes all the words at the same time. This means transformers can process big bodies of text in parallel.
Transformers enabled a rapid scaling up of the complexity of language models by increasing the number of parameters in the model, as well as other factors. The parameters can be thought of as connections between words, and models improve by adjusting these connections as they churn through text during training. The more parameters in a model, the more accurately it can make connections, and the closer it comes to passably mimicking human language. As expected, a 2020 analysis by OpenAI researchers found that models improve in accuracy and ability as they scale up.
But the debut of LLMs also brought something truly unexpected. Lots of somethings. With the advent of models like GPT-3, which has 175 billion parameters — or Google’s PaLM, which can be scaled up to 540 billion — users began describing more and more emergent behaviors. One DeepMind engineer even reported being able to convince ChatGPT that it was a Linux terminal and getting it to run some simple mathematical code to compute the first 10 prime numbers. Remarkably, it could finish the task faster than the same code running on a real Linux machine.
As with the movie emoji task, researchers had no reason to think that a language model built to predict text would convincingly imitate a computer terminal. Many of these emergent behaviors illustrate “zero-shot” or “few-shot” learning, which describes an LLM’s ability to solve problems it has never — or rarely — seen before. This has been a long-time goal in artificial intelligence research, Ganguli said. Showing that GPT-3 could solve problems without any explicit training data in a zero-shot setting, he said, “led me to drop what I was doing and get more involved.”
He wasn’t alone. A raft of researchers, detecting the first hints that LLMs could reach beyond the constraints of their training data, are striving for a better grasp of what emergence looks like and how it happens. The first step was to thoroughly document it.
In 2020, Dyer and others at Google Research predicted that LLMs would have transformative effects — but what those effects would be remained an open question. So they asked the research community to provide examples of difficult and diverse tasks to chart the outer limits of what an LLM could do. This effort was called the Beyond the Imitation Game Benchmark (BIG-bench) project, riffing on the name of Alan Turing’s “imitation game,” a test for whether a computer could respond to questions in a convincingly human way. (This would later become known as the Turing test.) The group was especially interested in examples where LLMs suddenly attained new abilities that had been completely absent before.
“How we understand these sharp transitions is a great research question,” Dyer said.
As one would expect, on some tasks a model’s performance improved smoothly and predictably as complexity increased. And on other tasks, scaling up the number of parameters did not yield any improvement. But for about 5% of the tasks, the researchers found what they called “breakthroughs” — rapid, dramatic jumps in performance at some threshold scale. That threshold varied based on the task and model.
For example, models with relatively few parameters — only a few million — could not successfully complete three-digit addition or two-digit multiplication problems, but for tens of billions of parameters, accuracy spiked in some models. Similar jumps occurred for other tasks including decoding the International Phonetic Alphabet, unscrambling a word’s letters, identifying offensive content in paragraphs of Hinglish (a combination of Hindi and English), and generating a similar English equivalent of Kiswahili proverbs.
But the researchers quickly realized that a model’s complexity wasn’t the only driving factor. Some unexpected abilities could be coaxed out of smaller models with fewer parameters — or trained on smaller data sets — if the data was of sufficiently high quality. In addition, how a query was worded influenced the accuracy of the model’s response. When Dyer and his colleagues posed the movie emoji task using a multiple-choice format, for example, the accuracy improvement was less of a sudden jump and more of a gradual increase with more complexity. And last year, in a paper presented at NeurIPS, the field’s flagship meeting, researchers at Google Brain showed how a model prompted to explain itself (a capacity called chain-of-thought reasoning) could correctly solve a math word problem, while the same model without that prompt could not.
Yi Tay, a scientist at Google Brain who worked on the systematic investigation of breakthroughs, points to recent work suggesting that chain-of-thought prompting changes the scaling curves and therefore the point where emergence occurs. In their NeurIPS paper, the Google researchers showed that using chain-of-thought prompts could elicit emergent behaviors not identified in the BIG-bench study. Such prompts, which ask the model to explain its reasoning, may help researchers begin to investigate why emergence occurs at all.
Recent findings like these suggest at least two possibilities for why emergence occurs, said Ellie Pavlick, a computer scientist at Brown University who studies computational models of language. One is that, as suggested by comparisons to biological systems, larger models truly do gain new abilities spontaneously. “It may very well be that the model has learned something fundamentally new and different that it didn’t have at a smaller size,” she said. “That’s what we’re all hoping is the case, that there’s some fundamental shift that happens when models are scaled up.”
The other, less sensational possibility, she said, is that what appears to be emergent may instead be the culmination of an internal, statistics-driven process that works through chain-of-thought-type reasoning. Large LLMs may simply be learning heuristics that are out of reach for those with fewer parameters or lower-quality data.
But, she said, finding out which of those explanations is more likely hinges on a better understanding of how LLMs work at all. “Since we don’t know how they work under the hood, we can’t say which of those things is happening.”
Unpredictable Powers and Pitfalls
There is an obvious problem with asking these models to explain themselves: They are notorious liars. “We’re increasingly relying on these models to do basic work,” Ganguli said, “but I do not just trust these. I check their work.” As one of many amusing examples, in February Google introduced its AI chatbot, Bard. The blog post announcing the new tool shows Bard making a factual error.
Emergence leads to unpredictability, and unpredictability — which seems to increase with scaling — makes it difficult for researchers to anticipate the consequences of widespread use.
“It’s hard to know in advance how these models will be used or deployed,” Ganguli said. “And to study emergent phenomena, you have to have a case in mind, and you won’t know until you study the influence of scale what capabilities or limitations might arise.”
In an analysis of LLMs released last June, researchers at Anthropic looked at whether the models would show certain types of racial or social biases, not unlike those previously reported in non-LLM-based algorithms used to predict which former criminals are likely to commit another crime. That study was inspired by an apparent paradox tied directly to emergence: As models improve their performance when scaling up, they may also increase the likelihood of unpredictable phenomena, including those that could potentially lead to bias or harm.
“Certain harmful behaviors kind of come up abruptly in some models,” Ganguli said. He points to a recent analysis of LLMs, known as the BBQ benchmark, which showed that social bias emerges with enormous numbers of parameters. “Larger models abruptly become more biased.” Failure to address that risk, he said, could jeopardize the subjects of these models.
But he offers a counterpoint: When the researchers simply told the model not to rely on stereotypes or social biases — literally by typing in those instructions — the model was less biased in its predictions and responses. This suggests that some emergent properties might also be used to reduce bias. In a paper released in February, the Anthropic team reported on a new “moral self-correction” mode, in which the user prompts the program to be helpful, honest and harmless.
Emergence, Ganguli said, reveals both surprising potential and unpredictable risk. Applications of these large LLMs are already proliferating, so a better understanding of that interplay will help harness the diversity of abilities of language models.
“We’re studying how people are actually using these systems,” Ganguli said. But those users are also tinkering, constantly. “We spend a lot of time just chatting with our models,” he said, “and that is actually where you start to get a good intuition about trust — or the lack thereof.”