Imagine that your neighbor calls to ask a favor: Could you please feed their pet rabbit some carrot slices? Easy enough, you’d think. You can imagine their kitchen, even if you’ve never been there — carrots in a fridge, a drawer holding various knives. It’s abstract knowledge: You don’t know what your neighbor’s carrots and knives look like exactly, but you won’t take a spoon to a cucumber.
Artificial intelligence programs can’t compete. What seems to you like an easy task is a huge undertaking for current algorithms.
An AI-trained robot can find a specified knife and carrot hiding in a familiar kitchen, but in a different kitchen it will lack the abstract skills to succeed. “They don’t generalize to new environments,” said Victor Zhong, a graduate student in computer science at the University of Washington. The machine fails because there’s simply too much to learn, and too vast a space to explore.
The problem is that these robots — and AI agents in general — don’t have a foundation of concepts to build on. They don’t know what a knife or a carrot really is, much less how to open a drawer, choose one and cut slices. This limitation is due in part to the fact that many advanced AI systems get trained with a method called reinforcement learning that’s essentially self-education through trial and error. AI agents trained with reinforcement learning can execute the job they were trained to do very well, in the environment they were trained to do it in. But change the job or the environment, and these systems will often fail.
To get around this limitation, computer scientists have begun to teach machines important concepts before setting them loose. It’s like reading a manual before using new software: You could try to explore without it, but you’ll learn far faster with it. “Humans learn through a combination of both doing and reading,” said Karthik Narasimhan, a computer scientist at Princeton University. “We want machines to do the same.”
New work from Zhong and others shows that priming a learning model in this way can supercharge learning in simulated environments, both online and in the real world with robots. And it doesn’t just make algorithms learn faster — it guides them toward skills they’d otherwise never learn. Researchers want these agents to become generalists, capable of learning anything from chess to shopping to cleaning. And as demonstrations become more practical, scientists think this approach might even change how humans can interact with robots.
“It’s been a pretty big breakthrough,” said Brian Ichter, a research scientist in robotics at Google. “It’s pretty unimaginable how far it’s come in a year and a half.”
At first glance, machine learning has already been remarkably successful. Most models typically use reinforcement learning, where algorithms learn by getting rewards. They begin totally ignorant, but trial and error eventually becomes trial and triumph. Reinforcement learning agents can easily master simple games.
Consider the video game Snake, where players control a snake that grows longer as it eats digital apples. You want your snake to eat the most apples, stay within the boundaries and avoid running into its increasingly bulky body. Such clear right and wrong outcomes give a well-rewarded machine agent positive feedback, so enough attempts can take it from “noob” to High Score.
But suppose the rules change. Perhaps the same agent must play on a larger grid and in three dimensions. While a human player could adapt quickly, the machine can’t, because of two critical weaknesses. First, the larger space means it takes longer for the snake to stumble upon apples, and learning slows exponentially when rewards become sparse. Second, the new dimension provides a totally new experience, and reinforcement learning struggles to generalize to new challenges.
Zhong says we don’t need to accept these obstacles. “Why is it that when we want to play chess” — another game that reinforcement learning has mastered — “we train a reinforcement learning agent from scratch?” Such approaches are inefficient. The agent wanders around aimlessly until it stumbles upon a good situation, such as a checkmate, and Zhong says it requires careful human design to get the agent to know what it means for a situation to be good. “Why do we have to do this when we already have so many books on how to play chess?”
Partly, it’s because machines have struggled to understand human language and decipher images in the first place. For a robot to complete vision-based tasks like finding and slicing carrots, for example, it must know what a carrot is — the image of a thing must be “grounded” in a more fundamental understanding of what that thing is. Until recently, there was no good way of doing that, but a boom in the speed and scale of language and image processing has made the new successes possible.
New natural language processing models allow machines to essentially learn the meaning behind words and sentences — to ground them in things in the world — rather than just store a simple (and limited) meaning like a digital dictionary.
Computer vision has seen a similar digital explosion. Around 2009, ImageNet debuted as a database of annotated images for computer vision research. Today it hosts over 14 million images of objects and places. And programs like OpenAI’s DALL·E generate new images upon command that look human-made, despite having no exact comparison to draw from.
It shows how machines only now have access to enough online data to really learn about the world, according to Anima Anandkumar, a computer scientist at the California Institute of Technology and Nvidia. And it’s a sign that they can learn from concepts as we do and use them for generation. “We are in such a great moment now,” she said. “Because once we can get generation, there is so much more we can do.”
Gaming the System
Researchers like Zhong decided machines didn’t have to embark on their explorations wholly uninformed anymore. Armed with sophisticated language models, the researchers could add a pre-training step where a program learned from online information before its trials and errors.
To test the idea, he and his colleagues compared the pre-training to traditional reinforcement learning in five different game-like settings where machine agents interpreted language commands to solve problems. Each simulated environment challenged the machine agent uniquely. One asked the agent to manipulate items in a 3D kitchen; another required reading text to learn a precise sequence of actions to fight monsters. But the most complicated setting was a real game, the 35-year-old NetHack, where the goal is to navigate a sophisticated dungeon to retrieve an amulet.
For the simple settings, automated pre-training meant simply grounding the important concepts: This is a carrot, that is a monster. For NetHack, the agent trained by watching humans play, using playthroughs uploaded to the internet by human players. These playthroughs didn’t even have to be that good — the agent only needed to build intuition for how humans behave. The agent wasn’t meant to become an expert, just a regular player. It would build intuition by watching — what would a human do in a given scenario? The agent would decide what moves were successful, formulating its own carrot and stick.
“Through pre-training, we form good priors for how to associate language descriptions with things that are happening in the world,” Zhong said. The agent would play better from the start and learn more quickly during subsequent reinforcement learning.
As a result, the pre-trained agent did outperform the traditionally trained one. “We get gains across the board in all five of these environments,” Zhong said. Simpler settings only showed a slight edge, but in NetHack’s complicated dungeons, the agent learned many times faster and reached a skill level that the classic approach couldn’t. “You might be getting a 10x performance because if you don’t do this, then you just don’t learn a good policy,” he said.
“These generalist agents are a big leap from what standard reinforcement learning does,” Anandkumar said.
Her team also pre-trains agents to get them to learn more quickly, achieving significant progress on the world’s bestselling video game, Minecraft. It’s known as a “sandbox” game, meaning it gives players a virtually infinite space in which to interact and create new worlds. It’s futile to program a reward function for thousands of tasks individually, so instead the team’s model (“MineDojo”) built its understanding of the game by watching captioned playthrough videos. No need to codify good behavior.
“We are getting automated reward functions,” Anandkumar said. “This is the first benchmark with thousands of tasks and the ability to do reinforcement learning with open-ended tasks specified through text prompts.”
Games were a great way to show that pre-training models could work, but they’re still simplified worlds. Training robots to handle the real world, where the possibilities are practically endless, is much harder. “We asked the question: Is there something in between?” Narasimhan said. So he decided to do some online shopping.
His team created WebShop. “It’s basically like a shopping butler,” Narasimhan said. Users can say something like “Give me a Nike shoe that’s white and under $100, and I want the reviews to state that they’re very comfortable for toddlers,” and the program finds and buys the shoe.
As with Zhong’s and Anandkumar’s games, WebShop developed an intuition by training with images and text, this time from Amazon pages. “Over time, it learns to understand the language and map it to actions it has to take on the website.”
At first glance, a shopping butler may not seem that futuristic. But while a cutting-edge chatbot can link you to a desired sneaker, interactions like placing the order require a wholly different skill set. And even though your bedside Alexa or Google Home speakers can place orders, they rely on proprietary software that carries out preordained tasks. WebShop navigates the web the way people do: by reading, typing and clicking.
“It’s a step closer toward general intelligence,” Narasimhan said.
Of course, getting robots to interact with the real world has its own challenges. Consider a bottle, for example. You can recognize one by its appearance, you know it’s meant to store liquids, and you understand how to manipulate it with your hands. Can real machines ever turn words and images into a complex intelligence of motion?
Narasimhan collaborated with Anirudha Majumdar, a roboticist at Princeton, to find out. They taught a robotic arm to manipulate tools it had never seen before, and pre-trained it using descriptive language taken from successful language models. The program learned faster and performed better with almost every tool and action, compared to programs learning by traditional exploration, according to results posted to the preprint server arxiv.org last June.
Engineers have built a library of even more complex commands at Google’s robotics labs, also rooted in context-building pre-training. “The world of possibilities that you have to consider is huge,” said Karol Hausman, a research scientist on the Google robotics team. “So we ask the language model to break it down for us.”
The team worked with a mobile helper robot, with a seven-jointed arm, which they trained using language skills. For any given command — like “help me clean my spilled drink” — the program uses a language model to suggest actions from a library of 700 trained motions, such as “grab” a paper towel, “pick up” the can, or “throw away” the can. And Hausman says it acknowledges its limitations with phrases such as “I’m actually not capable of wiping it down. But I can bring you a sponge.” The team recently reported results from this project, called SayCan.
Another perk of empowering robots with language models is that translating synonyms and words in other languages becomes trivial. One person can say “twist,” while another says “rotate,” and the robot understands both. “The craziest thing that we have tried is that it also understands emojis,” said Fei Xia, a research scientist at Google.
The Bots Are Learning
SayCan is perhaps the most advanced demonstration of language-grounded learning in robotics to date. And language and image models are constantly improving, creating better and more complex pre-training techniques.
But Xia is careful to temper the excitement. “Someone half-jokingly said we reached the ‘robot GPT’ moment,” he said, referring to the groundbreaking language models that understand a wide array of human commands. “We’re not there yet, and there is much more to be explored.”
For instance, these models can provide incorrect answers or take errant actions, which researchers are trying to understand. Robots also haven’t yet mastered “embodiment”: Whereas humans have a physical intuition built on childhoods spent playing with toys, robots still require real-world interactions to develop this type of intuition. “For some settings, there are a lot of unlabeled demonstrations,” Zhong said — think of databases of video game interactions like Minecraft and NetHack. No database can quickly teach robots intelligent motion.
Still, progress is happening fast. And more researchers believe that smarter robotics will be the end result. Narasimhan traces this human-robot evolution from punch cards to the next technology. “We had keyboards and mice and then touch screens,” he said. Grounded language is next. You’ll speak to your computer for answers and errands. “This whole dream of assistants being really capable has not happened yet,” he said. “But I think it will happen very soon.”