When engineers first endeavored to teach computers to see, they took it for granted that computers would see like humans. The first proposals for computer vision in the 1960s were “clearly motivated by characteristics of human vision,” said John Tsotsos, a computer scientist at York University.
Things have changed a lot since then.
Computer vision has grown from a pie-in-the-sky idea into a sprawling field. Computers can now outperform human beings in some vision tasks, like classifying pictures — dog or wolf? — and detecting anomalies in medical images. And the way artificial “neural networks” process visual data looks increasingly dissimilar from the way humans do.
Computers are beating us at our own game by playing by different rules.
The neural networks underlying computer vision are fairly straightforward. They receive an image as input and process it through a series of steps. They first detect pixels, then edges and contours, then whole objects, before eventually producing a final guess about what they’re looking at. These are known as “feed forward” systems because of their assembly-line setup.
There is a lot we don’t know about human vision, but we know it doesn’t work like that. In our recent story, “A Mathematical Model Unlocks the Secrets of Vision,” Quanta described a new mathematical model that tries to explain the central mystery of human vision: how the visual cortex in the brain creates vivid, accurate representations of the world based on the scant information it receives from the retina.
The model suggests that the visual cortex achieves this feat through a series of neural feedback loops that refine small changes in data from the outside world into the diverse range of images that appear before our mind’s eye. This feedback process is very different from the feed-forward methods that enable computer vision.
“This work really shows how sophisticated and in some sense different the visual cortex is” from computer vision, said Jonathan Victor, a neuroscientist at Cornell University.
But computer vision is superior to human vision at some tasks. This raises the question: Does computer vision need inspiration from human vision at all?
In some ways, the answer is obviously no. The information that reaches the visual cortex is constrained by anatomy: Relatively few nerves connect the visual cortex with the outside world, which limits the amount of visual data the cortex has to work with. Computers don’t have the same bandwidth concerns, so there’s no reason they need to work with sparse information.
“If I had infinite computing power and infinite memory, do I need to sparsify anything? The answer is likely no,” Tsotsos said.
But Tsotsos thinks it’s folly to disregard human vision.
The classification tasks computers are good at today are the “low-hanging fruit” of computer vision, he said. To master these tasks, computers merely need to find correlations in massive data sets. For higher-order tasks, like scanning an object from multiple angles in order to determine what it is (think about the way you familiarize yourself with a statue by walking around it), such correlations may not be enough to go on. Computers may need to take a nod from humans to get it right.
(In an interview with Quanta Magazine last year, the artificial intelligence pioneer Judea Pearl made this point more generally when he argued that correlation training won’t get AI systems very far in the long run.)
For example, a key feature of human vision is the ability to do a double take. We process visual information and reach a conclusion about what we’ve seen. When that conclusion is jarring, we look again, and often the second glance tells us what’s really going on. Computer vision systems working in a feed-forward manner typically lack this ability, which leads computer vision systems to fail spectacularly at even some simple vision tasks.
There’s another, subtler and more important aspect of human vision that computer vision lacks.
It takes years for the human visual system to mature. A 2019 paper by Tsotsos and his collaborators found that people don’t fully acquire the ability to suppress clutter in a crowded scene and focus on what they’re looking for until around age 17. Other research has found that the ability to perceive faces keeps improving until around age 20.
Computer vision systems work by digesting massive amounts of data. Their underlying architecture is fixed and doesn’t mature over time, the way the developing brain does. If the underlying learning mechanisms are so different, will the results be, too? Tsotsos thinks computer vision systems are in for a reckoning.
“Learning in these deep learning methods is as unrelated to human learning as can be,” he said. “That tells me the wall is coming. You’ll reach a point where these systems can no longer move forward in terms of their development.”