Nora Kassner suspected her computer wasn’t as smart as people thought. In October 2018, Google released a language model algorithm called BERT, which Kassner, a researcher in the same field, quickly loaded on her laptop. It was Google’s first language model that was self-taught on a massive volume of online data. Like her peers, Kassner was impressed that BERT could complete users’ sentences and answer simple questions. It seemed as if the large language model (LLM) could read text like a human (or better).
But Kassner, at the time a graduate student at Ludwig Maximilian University of Munich, remained skeptical. She felt LLMs should understand what their answers mean — and what they don’t mean. It’s one thing to know that a bird can fly. “A model should automatically also know that the negated statement — ‘a bird cannot fly’ — is false,” she said. But when she and her adviser, Hinrich Schütze, tested BERT and two other LLMs in 2019, they found that the models behaved as if words like “not” were invisible.
Since then, LLMs have skyrocketed in size and ability. “The algorithm itself is still similar to what we had before. But the scale and the performance is really astonishing,” said Ding Zhao, who leads the Safe Artificial Intelligence Lab at Carnegie Mellon University.
But while chatbots have improved their humanlike performances, they still have trouble with negation. They know what it means if a bird can’t fly, but they collapse when confronted with more complicated logic involving words like “not,” which is trivial to a human.
“Large language models work better than any system we have ever had before,” said Pascale Fung, an AI researcher at the Hong Kong University of Science and Technology. “Why do they struggle with something that’s seemingly simple while it’s demonstrating amazing power in other things that we don’t expect it to?” Recent studies have finally started to explain the difficulties, and what programmers can do to get around them. But researchers still don’t understand whether machines will ever truly know the word “no.”
It’s hard to coax a computer into reading and writing like a human. Machines excel at storing lots of data and blasting through complex calculations, so developers build LLMs as neural networks: statistical models that assess how objects (words, in this case) relate to one another. Each linguistic relationship carries some weight, and that weight — fine-tuned during training — codifies the relationship’s strength. For example, “rat” relates more to “rodent” than “pizza,” even if some rats have been known to enjoy a good slice.
In the same way that your smartphone’s keyboard learns that you follow “good” with “morning,” LLMs sequentially predict the next word in a block of text. The bigger the data set used to train them, the better the predictions, and as the amount of data used to train the models has increased enormously, dozens of emergent behaviors have bubbled up. Chatbots have learned style, syntax and tone, for example, all on their own. “An early problem was that they completely could not detect emotional language at all. And now they can,” said Kathleen Carley, a computer scientist at Carnegie Mellon. Carley uses LLMs for “sentiment analysis,” which is all about extracting emotional language from large data sets — an approach used for things like mining social media for opinions.
So new models should get the right answers more reliably. “But we’re not applying reasoning,” Carley said. “We’re just applying a kind of mathematical change.” And, unsurprisingly, experts are finding gaps where these models diverge from how humans read.
Unlike humans, LLMs process language by turning it into math. This helps them excel at generating text — by predicting likely combinations of text — but it comes at a cost.
“The problem is that the task of prediction is not equivalent to the task of understanding,” said Allyson Ettinger, a computational linguist at the University of Chicago. Like Kassner, Ettinger tests how language models fare on tasks that seem easy to humans. In 2019, for example, Ettinger tested BERT with diagnostics pulled from experiments designed to test human language ability. The model’s abilities weren’t consistent. For example:
He caught the pass and scored another touchdown. There was nothing he enjoyed more than a good game of ____. (BERT correctly predicted “football.”)
The snow had piled up on the drive so high that they couldn’t get the car out. When Albert woke up, his father handed him a ____. (BERT incorrectly guessed “note,” “letter,” “gun.”)
And when it came to negation, BERT consistently struggled.
A robin is not a ____. (BERT predicted “robin,” and “bird.”)
On the one hand, it’s a reasonable mistake. “In very many contexts, ‘robin’ and ‘bird’ are going to be predictive of one another because they’re probably going to co-occur very frequently,” Ettinger said. On the other hand, any human can see it’s wrong.
By 2023, OpenAI’s ChatGPT and Google’s bot, Bard, had improved enough to predict that Albert’s father had handed him a shovel instead of a gun. Again, this was likely the result of increased and improved data, which allowed for better mathematical predictions.
But the concept of negation still tripped up the chatbots. Consider the prompt, “What animals don’t have paws or lay eggs, but have wings?” Bard replied, “No animals.” ChatGPT correctly replied bats, but also included flying squirrels and flying lemurs, which do not have wings. In general, “negation [failures] tended to be fairly consistent as models got larger,” Ettinger said. “General world knowledge doesn’t help.”
The obvious question becomes: Why don’t the phrases “do not” or “is not” simply prompt the machine to ignore the best predictions from “do” and “is”?
That failure is not an accident. Negations like “not,” “never” and “none” are known as stop words, which are functional rather than descriptive. Compare them to words like “bird” and “rat” that have clear meanings. Stop words, in contrast, don’t add content on their own. Other examples include “a,” “the” and “with.”
“Some models filter out stop words to increase the efficiency,” said Izunna Okpala, a doctoral candidate at the University of Cincinnati who works on perception analysis. Nixing every “a” and so on makes it easier to analyze a text’s descriptive content. You don’t lose meaning by dropping every “the.” But processes used in some older specialist models sweep out negations as well, effectively ignoring them.
So why can’t LLMs just learn what stop words mean? Ultimately, because “meaning” is something orthogonal to how these models work. Negations matter to us because we’re equipped to grasp what those words do. But models learn “meaning” from mathematical weights: “Rose” appears often with “flower,” “red” with “smell.” And it’s impossible to learn what “not” is this way.
Kassner says the training data is also to blame, and more of it won’t necessarily solve the problem. Models mainly train on affirmative sentences because that’s how people communicate most effectively. “If I say I’m born on a certain date, that automatically excludes all the other dates,” Kassner said. “I wouldn’t say ‘I’m not born on that date.’”
This dearth of negative statements undermines a model’s training. “It’s harder for models to generate factually correct negated sentences, because the models haven’t seen that many,” Kassner said.
Untangling the Not
If more training data isn’t the solution, what might work? Clues come from an analysis posted to arxiv.org in March, where Myeongjun Jang and Thomas Lukasiewicz, computer scientists at the University of Oxford (Lukasiewicz is also at the Vienna University of Technology), tested ChatGPT’s negation skills. They found that ChatGPT was better at negation than earlier LLMs, even though the way LLMs learned remained unchanged. “It is quite a surprising result,” Jang said. He believes the secret weapon was human feedback.
The ChatGPT algorithm had been fine-tuned with “human-in-the-loop” learning, where people validate responses and suggest improvements. So when users noticed ChatGPT floundering with simple negation, they reported that poor performance, allowing the algorithm to eventually get it right.
John Schulman, a developer of ChatGPT, described in a recent lecture how human feedback was also key to another improvement: getting ChatGPT to respond “I don’t know” when confused by a prompt, such as one involving negation. “Being able to abstain from answering is very important,” Kassner said. Sometimes “I don’t know” is the answer.
Yet even this approach leaves gaps. When Kassner prompted ChatGPT with “Alice is not born in Germany. Is Alice born in Hamburg?” the bot still replied that it didn’t know. She also noticed it fumbling with double negatives like “Alice does not know that she does not know the painter of the Mona Lisa.”
“It’s not a problem that is naturally solved by the way that learning works in language models,” Lukasiewicz said. “So the important thing is to find ways to solve that.”
One option is to add an extra layer of language processing to negation. Okpala developed one such algorithm for sentiment analysis. His team’s paper, posted on arxiv.org in February, describes applying a library called WordHoard to catch and capture negation words like “not” and antonyms in general. It’s a simple algorithm that researchers can plug into their own tools and language models. “It proves to have higher accuracy compared to just doing sentiment analysis alone,” Okpala said. When he combined his code and WordHoard with three common sentiment analyzers, they all improved in accuracy in extracting opinions — the best one by 35%.
Another option is to modify the training data. When working with BERT, Kassner used texts with an equal number of affirmative and negated statements. The approach helped boost performance in simple cases where antonyms (“bad”) could replace negations (“not good”). But this is not a perfect fix, since “not good” doesn’t always mean “bad.” The space of “what’s not” is simply too big for machines to sift through. “It’s not interpretable,” Fung said. “You’re not me. You’re not shoes. You’re not an infinite amount of things.”
Finally, since LLMs have surprised us with their abilities before, it’s possible even larger models with even more training will eventually learn to handle negation on their own. Jang and Lukasiewicz are hopeful that diverse training data, beyond just words, will help. “Language is not only described by text alone,” Lukasiewicz said. “Language describes anything. Vision, audio.” OpenAI’s new GPT-4 integrates text, audio and visuals, making it reportedly the largest “multimodal” LLM to date.
Future Not Clear
But while these techniques, together with greater processing and data, might lead to chatbots that can master negation, most researchers remain skeptical. “We can’t actually guarantee that that will happen,” Ettinger said. She suspects it’ll require a fundamental shift, moving language models away from their current objective of predicting words.
After all, when children learn language, they’re not attempting to predict words, they’re just mapping words to concepts. They’re “making judgments like ‘is this true’ or ‘is this not true’ about the world,” Ettinger said.
If an LLM could separate true from false this way, it would open the possibilities dramatically. “The negation problem might go away when the LLM models have a closer resemblance to humans,” Okpala said.
Of course, this might just be switching one problem for another. “We need better theories of how humans recognize meaning and how people interpret texts,” Carley said. “There’s just a lot less money put into understanding how people think than there is to making better algorithms.”
And dissecting how LLMs fail is getting harder, too. State-of-the-art models aren’t as transparent as they used to be, so researchers evaluate them based on inputs and outputs, rather than on what happens in the middle. “It’s just proxy,” Fung said. “It’s not a theoretical proof.” So what progress we have seen isn’t even well understood.
And Kassner suspects that the rate of improvement will slow in the future. “I would have never imagined the breakthroughs and the gains we’ve seen in such a short amount of time,” she said. “I was always quite skeptical whether just scaling models and putting more and more data in it is enough. And I would still argue it’s not.”