The danger of having artificially intelligent machines do our bidding is that we might not be careful enough about what we wish for. The lines of code that animate these machines will inevitably lack nuance, forget to spell out caveats, and end up giving AI systems goals and incentives that don’t align with our true preferences.

A now-classic thought experiment illustrating this problem was posed by the Oxford philosopher Nick Bostrom in 2003. Bostrom imagined a superintelligent robot, programmed with the seemingly innocuous goal of manufacturing paper clips. The robot eventually turns the whole world into a giant paper clip factory.

Such a scenario can be dismissed as academic, a worry that might arise in some far-off future. But misaligned AI has become an issue far sooner than expected.

The most alarming example is one that affects billions of people. YouTube, aiming to maximize viewing time, deploys AI-based content recommendation algorithms. Two years ago, computer scientists and users began noticing that YouTube’s algorithm seemed to achieve its goal by recommending increasingly extreme and conspiratorial content. One researcher reported that after she viewed footage of Donald Trump campaign rallies, YouTube next offered her videos featuring “white supremacist rants, Holocaust denials and other disturbing content.” The algorithm’s upping-the-ante approach went beyond politics, she said: “Videos about vegetarianism led to videos about veganism. Videos about jogging led to videos about running ultramarathons.” As a result, research suggests, YouTube’s algorithm has been helping to polarize and radicalize people and spread misinformation, just to keep us watching. “If I were planning things out, I probably would not have made that the first test case of how we’re going to roll out this technology at a massive scale,” said Dylan Hadfield-Menell, an AI researcher at the University of California, Berkeley.

YouTube’s engineers probably didn’t intend to radicalize humanity. But coders can’t possibly think of everything. “The current way we do AI puts a lot of burden on the designers to understand what the consequences of the incentives they give their systems are,” said Hadfield-Menell. “And one of the things we’re learning is that a lot of engineers have made mistakes.”

A major aspect of the problem is that humans often don’t know what goals to give our AI systems, because we don’t know what we really want. “If you ask anyone on the street, ‘What do you want your autonomous car to do?’ they would say, ‘Collision avoidance,’” said Dorsa Sadigh, an AI scientist at Stanford University who specializes in human-robot interaction. “But you realize that’s not just it; there are a bunch of preferences that people have.” Super safe self-driving cars go too slow and brake so often that they make passengers sick. When programmers try to list all goals and preferences that a robotic car should simultaneously juggle, the list inevitably ends up incomplete. Sadigh said that when driving in San Francisco, she has often gotten stuck behind a self-driving car that’s stalled in the street. It’s safely avoiding contact with a moving object, the way its programmers told it to — but the object is something like a plastic bag blowing in the wind.

To avoid these pitfalls and potentially solve the AI alignment problem, researchers have begun to develop an entirely new method of programming beneficial machines. The approach is most closely associated with the ideas and research of Stuart Russell, a decorated computer scientist at Berkeley. Russell, 57, did pioneering work on rationality, decision-making and machine learning in the 1980s and ’90s and is the lead author of the widely used textbook Artificial Intelligence: A Modern Approach. In the past five years, he has become an influential voice on the alignment problem and a ubiquitous figure — a well-spoken, reserved British one in a black suit — at international meetings and panels on the risks and long-term governance of AI.