When you look at a photograph of a cat, chances are that you can recognize the pictured animal whether it’s ginger or striped — or whether the image is black and white, speckled, worn or faded. You can probably also spot the pet when it’s shown curled up behind a pillow or leaping onto a countertop in a blur of motion. You have naturally learned to identify a cat in almost any situation. In contrast, machine vision systems powered by deep neural networks can sometimes even outperform humans at recognizing a cat under fixed conditions, but images that are even a little novel, noisy or grainy can throw off those systems completely.
A research team in Germany has now discovered an unexpected reason why: While humans pay attention to the shapes of pictured objects, deep learning computer vision algorithms routinely latch on to the objects’ textures instead.
This finding, presented at the International Conference on Learning Representations in May, highlights the sharp contrast between how humans and machines “think,” and illustrates how misleading our intuitions can be about what makes artificial intelligences tick. It may also hint at why our own vision evolved the way it did.
Cats With Elephant Skin and Planes Made of Clocks
Deep learning algorithms work by, say, presenting a neural network with thousands of images that either contain or do not contain cats. The system finds patterns in that data, which it then uses to decide how best to label an image it has never seen before. The network’s architecture is modeled loosely on that of the human visual system, in that its connected layers let it extract increasingly abstract features from the image. But the system makes the associations that lead it to the right answer through a black-box process that humans can only try to interpret after the fact. “We’ve been trying to figure out what leads to the success of these deep learning computer vision algorithms, and what leads to their brittleness,” said Thomas Dietterich, a computer scientist at Oregon State University who was not involved in the new study.
To do that, some researchers prefer to look at what happens when they trick the network by modifying an image. They have found that very small changes can cause the system to mislabel objects in an image completely — and that large changes can sometimes fail to make the system modify its label at all. Meanwhile, other experts have backtracked through networks to analyze what the individual “neurons” respond to in an image, generating an “activation atlas” of features that the system has learned.
But a group of scientists in the laboratories of the computational neuroscientist Matthias Bethge and the psychophysicist Felix Wichmann at the University of Tübingen in Germany took a more qualitative approach. Last year, the team reported that when they trained a neural network on images degraded by a particular kind of noise, it got better than humans at classifying new images that had been subjected to the same type of distortion. But those images, when altered in a slightly different way, completely duped the network, even though the new distortion looked practically the same as the old one to humans.
To explain that result, the researchers thought about what quality changes the most with even small levels of noise. Texture seemed the obvious choice. “The shape of the object … is more or less intact if you add a lot of noise for a long time,” said Robert Geirhos, a graduate student in Bethge’s and Wichmann’s labs and the lead author of the study. But “the local structure in an image — that gets distorted super fast when you add a bit of noise.” So they came up with a clever way to test how both humans and deep learning systems process images.
Geirhos, Bethge and their colleagues created images that included two conflicting cues, with a shape taken from one object and a texture from another: the silhouette of a cat colored in with the cracked gray texture of elephant skin, for instance, or a bear made up of aluminum cans, or the outline of an airplane filled with overlapping clock faces. Presented with hundreds of these images, humans labeled them based on their shape — cat, bear, airplane — almost every time, as expected. Four different classification algorithms, however, leaned the other way, spitting out labels that reflected the textures of the objects: elephant, can, clock.
“This is changing our understanding of how deep feed-forward neural networks — out of the box, or the way they’re usually trained — do visual recognition,” said Nikolaus Kriegeskorte, a computational neuroscientist at Columbia University who did not participate in the study.
Odd as artificial intelligence’s preference for texture over shape may seem at first, it makes sense. “You can think of texture as shape at a fine scale,” Kriegeskorte said. That fine scale is easier for the system to latch on to: The number of pixels with texture information far exceeds the number of pixels that constitute the boundary of an object, and the network’s very first steps involve detecting local features like lines and edges. “That’s what texture is,” said John Tsotsos, a computational vision scientist at York University in Toronto who was also not involved in the new work. “Groupings of line segments that all line up in the same way, for example.”
Geirhos and his colleagues have shown that those local features are sufficient to allow a network to perform image classification tasks. In fact, Bethge and another of the study’s authors, the postdoctoral researcher Wieland Brendel, drove this point home in a paper that was also presented at the conference in May. In that work, they built a deep learning system that operated a lot like classification algorithms before the advent of deep learning — like a “bag of features.” It split up an image into tiny patches, just as current models (like those that Geirhos used in his experiment) initially would, but then, rather than integrating that information gradually to extract higher-level features, it made immediate decisions about the content of each small patch (“this patch contains evidence for a bicycle, that patch contains evidence for a bird”). It simply added those decisions together to determine the identity of the object (“more patches contain evidence for a bicycle, so this is an image of a bicycle”), without any regard for the global spatial relationships between the patches. And yet it could recognize objects with surprising accuracy.
“This challenges the assumption that deep learning is doing something completely different” than what previous models did, Brendel said. “Obviously … there’s been a leap. I’m just suggesting the leap is not as far as some people may have hoped for.”
According to Amir Rosenfeld, a postdoctoral researcher at York University and the University of Toronto who did not participate in the study, there are still “large differences between what we think networks should be doing and what they actually do,” including how well they reproduce human behavior.
Brendel expressed a similar view. It’s easy to assume neural networks will solve tasks the way we humans do, he said. “But we tend to forget there are other ways.”
A Nudge Toward More Human Sight
Current deep learning methods can integrate local features like texture into more global patterns like shape. “What is a bit surprising in these papers, and very compellingly demonstrated, is that while the architecture allows for that, it doesn’t automatically happen if you just train it [to classify standard images],” Kriegeskorte said.
Geirhos wanted to see what would happen when the team forced their models to ignore texture. The team took images traditionally used to train classification algorithms and “painted” them in different styles, essentially stripping them of useful texture information. When they retrained each of the deep learning models on the new images, the systems began relying on larger, more global patterns and exhibited a shape bias much more like that of humans.
And when that happened, the algorithms also became better at classifying noisy images, even when they hadn’t been trained to deal with those kinds of distortions. “The shape-based network got more robust for free,” Geirhos said. “This tells us that just having the right kind of bias for specific tasks, in this case a shape bias, helps a lot with generalizing to a novel setting.”
It also hints that humans might naturally have this kind of bias because shape is a more robust way of defining what we see, even in novel or noisy situations. Humans live in a three-dimensional world, where objects are seen from multiple angles under many different conditions, and where our other senses, such as touch, can contribute to object recognition as needed. So it makes sense for our vision to prioritize shape over texture. (Moreover, some psychologists have shown a link between language, learning and humans’ shape bias: When very young children were trained to pay more attention to shape by learning certain categories of words, they were later able to develop a much larger noun or object vocabulary than children who did not receive the training.)
The work serves as a reminder that “data exert more biases and influences than we believe,” Wichmann said. This isn’t the first time researchers have encountered the problem: Facial recognition programs, automated hiring algorithms and other neural networks have previously been shown to give too much weight to unexpected features because of deep-rooted biases in the data they were trained on. Removing those unwanted biases from their decision-making process has proved difficult, but Wichmann said the new work shows it is possible, which he finds encouraging.
Nevertheless, even Geirhos’ models that focused on shape could be defeated by too much noise in an image, or by particular pixel changes — which shows that they are a long way from achieving human-level vision. (In a similar vein, Rosenfeld, Tsotsos and Markus Solbach, a graduate student in Tsotsos’ lab, also recently published research showing that machine learning algorithms cannot perceive similarities between different images as humans can.) Still, with studies like these, “you’re putting your finger on where the important mechanisms of the human brain are not yet captured by these models,” Kriegeskorte said. And “in some cases,” Wichmann said, “perhaps looking at the data set is more important.”
Sanja Fidler, a computer scientist at the University of Toronto who did not participate in the study, agreed. “It’s up to us to design clever data, clever tasks,” she said. She and her colleagues are studying how giving neural networks secondary tasks can help them perform their main function. Inspired by Geirhos’ findings, they recently trained an image classification algorithm not just to recognize the objects themselves, but also to identify which pixels were part of their outline, or shape. The network automatically got better at its regular object identification task. “Given a single task, you get selective attention and become blind to lots of different things,” Fidler said. “If I give you multiple tasks, you might be aware of more things, and that might not happen. It’s the same for these algorithms.” Solving various tasks allows them “to develop biases toward different information,” which is similar to what happened in Geirhos’ experiments on shape and texture.
All this research is “an exciting step in deepening our understanding of what’s going on [in deep learning], perhaps helping us overcome the limitations we’re seeing,” Dietterich said. “That’s why I love this string of papers.”