Computer Science

A Common Logic to Seeing Cats and Cosmos

There may be a universal logic to how physicists, computers and brains tease out important features from among other irrelevant bits of data.

Olena Shmahalo / Quanta Magazine

There may be a universal logic to how physicists, computers and brains tease out important features from among other irrelevant bits of data.

When in 2012 a computer learned to recognize cats in YouTube videos and just last month another correctly captioned a photo of “a group of young people playing a game of Frisbee,” artificial intelligence researchers hailed yet more triumphs in “deep learning,” the wildly successful set of algorithms loosely modeled on the way brains grow sensitive to features of the real world simply through exposure.

Using the latest deep-learning protocols, computer models consisting of networks of artificial neurons are becoming increasingly adept at image, speech and pattern recognition — core technologies in robotic personal assistants, complex data analysis and self-driving cars. But for all their progress training computers to pick out salient features from other, irrelevant bits of data, researchers have never fully understood why the algorithms or biological learning work.

Now, two physicists have shown that one form of deep learning works exactly like one of the most important and ubiquitous mathematical techniques in physics, a procedure for calculating the large-scale behavior of physical systems such as elementary particles, fluids and the cosmos.

The new work, completed by Pankaj Mehta of Boston University and David Schwab of Northwestern University, demonstrates that a statistical technique called “renormalization,” which allows physicists to accurately describe systems without knowing the exact state of all their component parts, also enables the artificial neural networks to categorize data as, say, “a cat” regardless of its color, size or posture in a given video.

Courtesy of Pankaj Mehta

Pankaj Mehta, an assistant professor of physics at Boston University.

“They actually wrote down on paper, with exact proofs, something that people only dreamed existed,” said Ilya Nemenman, a biophysicist at Emory University. “Extracting relevant features in the context of statistical physics and extracting relevant features in the context of deep learning are not just similar words, they are one and the same.”

As for our own remarkable knack for spotting a cat in the bushes, a familiar face in a crowd or indeed any object amid the swirl of color, texture and sound that surrounds us, strong similarities between deep learning and biological learning suggest that the brain may also employ a form of renormalization to make sense of the world.

“Maybe there is some universal logic to how you can pick out relevant features from data,” said Mehta. “I would say this is a hint that maybe something like that exists.”

The finding formalizes what Schwab, Mehta and others saw as a philosophical similarity between physicists’ techniques and the learning procedure behind object or speech recognition. Renormalization is “taking a really complicated system and distilling it down to the fundamental parts,” Schwab said. “And that’s what deep neural networks are trying to do as well. And what brains are trying to do.”

Learning in Layers

A decade ago, deep learning didn’t seem to work. Computer models running the procedure often failed to recognize objects in photos or spoken words in audio recordings.

Geoffrey Hinton, a British computer scientist at the University of Toronto, and other researchers had devised the procedure to run on multilayered webs of virtual neurons that transmit signals to their neighbors by “firing” on and off. The design of these “deep” neural networks was inspired by the layered architecture of the human visual cortex — the part of the brain that transforms a flood of photons into meaningful perceptions.

When a person looks at a cat walking across a lawn, the visual cortex appears to process the scene hierarchically, with neurons in each successive layer firing in response to larger-scale, more pronounced features. At first, neurons in the retina might fire if they detect contrasts in their patch of the visual field, indicating an edge or endpoint. These signals travel to higher-layer neurons, which are sensitive to combinations of edges and other increasingly complex parts. Moving up the layers, a whisker signal might pair with another whisker signal, and those might join forces with pointy ears, ultimately triggering a top-layer neuron that corresponds to the concept of a cat.

A decade ago, Hinton was trying to replicate the process by which a developing infant’s brain becomes attuned to the relevant correlations in sensory data, learning to group whiskers with ears rather than the flowers behind. Hinton tried to train deep neural networks to do this using a simple learning rule that he and the neuroscientist Terry Sejnowski had come up with in the 1980s. When sounds or images were fed into the bottom layer of a deep neural network, the data set off a cascade of firing activity. The firing of one virtual neuron could trigger a connected neuron in an adjacent layer to fire, too, depending on the strength of the connection between them. The connections were initially assigned a random distribution of strengths, but when two neurons fired together in response to data, Hinton and Sejnowski’s algorithm dictated that their connection should strengthen, boosting the chance that the connection would continue to successfully transmit signals. Conversely, little-used connections were weakened. As more images or sounds were processed, their patterns gradually wore ruts in the network, like systems of tributaries trickling upward through the layers. In theory, the tributaries would converge on a handful of top-layer neurons, which would represent sound or object categories.

The problem was that data took too long to blaze trails all the way from the bottom network layer to the categories at the top. The algorithm wasn’t efficient enough.

Then, in 2005, Hinton and colleagues devised a new training regimen inspired by an aspect of brain development that he first learned about as a Cambridge University student in the 1960s. In dissections of cat brains, the biologist Colin Blakemore had discovered that the visual cortex develops in stages, tweaking its connections in response to sensory data one layer at a time, starting with the retina.

To replicate the visual cortex’s step-by-step development, Hinton ran the learning algorithm on his network one layer at a time, training each layer’s connections before using its output — a broader-brush representation of the original data — as the input for training the layer above, and then fine-tuned the network as a whole. The learning process became dramatically more efficient. Soon, deep learning was shattering accuracy records in image and speech recognition. Entire research programs devoted to it have sprung up at Google, Facebook and Microsoft.

Courtesy of David Schwab

David Schwab, an assistant professor of physics at Northwestern University.

“In the hands of Hinton [and others], these deep neural networks became the best classifiers around,” said Naftali Tishby, a computational neuroscientist and computer scientist at Hebrew University of Jerusalem. “This was very frustrating for the theoreticians in machine learning because they didn’t understand why it works so well.”

Deep learning worked in large part because the brain works. The analogy is far from perfect; cortical layers are more complicated than artificial ones, with their own internal networks humming away at unknown algorithms, and deep learning has branched off in directions of its own in the years since Hinton’s breakthrough, employing biologically implausible algorithms for many learning tasks. But Hinton, who now splits his time between the University of Toronto and Google, considers one principle to be key to both machine and biological learning: “You first learn simple features and then based on those you learn more complicated features, and it goes in stages.”

Quarks to Tables

In 2010, Schwab, then a postdoctoral researcher in biophysics at Princeton University, rode the train into New York City to hear Hinton lecture about deep learning. Hinton’s layer-by-layer training procedure immediately reminded him of a technique that is used all over physics and which Schwab views as “sort of the embodiment of what physics is,” he said.

When he got back to Princeton, Schwab called up Mehta and asked if he thought deep learning sounded a lot like renormalization. The two had been friends and collaborators since meeting years earlier at a summer research program and frequently ran “crazy ideas” past each other. Mehta didn’t find this idea particularly crazy, and the two set to work trying to figure out whether their intuition was correct. “We called each other in the middle of the night and talked all the time,” Mehta said. “It was kind of our obsession.”

Renormalization is a systematic way of going from a microscopic to a macroscopic picture of a physical system, latching onto the elements that affect its large-scale behavior and averaging over the rest. Fortunately for physicists, most microscopic details don’t matter; describing a table doesn’t require knowing the interactions between all its subatomic quarks. But a suite of sophisticated approximation schemes is required to slide up the distance scales, dilating the relevant details and blurring out irrelevant ones along the way.

Mehta and Schwab’s breakthrough came over drinks at the Montreal Jazz Festival when they decided to focus on a procedure called variational or “block-spin” renormalization that the statistical physicist Leo Kadanoff invented in 1966. The block-spin method involves grouping components of a system into larger and larger blocks, each an average of the components within it. The approach works well for describing fractal-like objects, which look similar at all scales, at different levels of resolution; Kadanoff’s canonical example was the two-dimensional Ising model — a lattice of “spins,” or tiny magnets that point up or down. He showed that one could easily zoom out on the lattice by transforming from a description in terms of spins to one in terms of blocks of spins.

Hoping to connect the approach to the hierarchical representation of data in deep learning, Schwab and Mehta hopscotched between Kadanoff’s old papers and a pair of highly cited 2006 papers by Hinton and colleagues detailing the first deep-learning protocol. Eventually, they saw how to map the mathematics of one procedure onto the other, proving that the two mechanisms for summarizing features of the world work essentially the same way.

Olena Shmahalo / Quanta Magazine

A technique invented by Leo Kadanoff in 1966 for describing a lattice of “spins” at different levels of resolution is equivalent to a modern deep learning protocol.

To illustrate the equivalence, Schwab and Mehta trained a four-layer neural network with 20,000 examples of the Ising model lattice. From one layer to the next, the neurons spontaneously came to represent bigger and bigger blocks of spins, summarizing the data using Kadanoff’s method. “It learns from the samples that it should block-renormalize,” Mehta said. “It was astounding to us that you don’t put that in by hand, and it learns.”

A deep neural network might use a different, more flexible form of renormalization when confronted with a cat photo rather than a fractal-like lattice of magnets, but researchers conjecture that it likewise would move layer by layer from the scale of pixels to the scale of pets by teasing out and aggregating cat-relevant correlations in the data.

Summarizing the World

Researchers hope cross-fertilization between statistical physics and deep learning will yield new advances in both fields, but it is too soon to tell “what the killer app is going to be for either direction,” Schwab said.

Because deep learning tailors itself to the data at hand, researchers hope that it will prove useful for evaluating behaviors of systems that are too messy for conventional renormalization schemes, such as aggregates of cells or complex proteins. For these biological systems that lack symmetry and look nothing like a fractal, “none of the mechanical steps that we’ve developed in statistical physics work,” Nemenman said. “But we still know that there is a coarse-grained description because our own brain can operate in the real world. It wouldn’t be able to if the real world were not summarizable.”

Through deep learning, there is also the hope of a better theoretical understanding of human cognition. Vijay Balasubramanian, a physicist and neuroscientist at the University of Pennsylvania, said he and other experts who span his two fields have long noticed the conceptual similarity between renormalization and human perception. “The development in Pankaj and David’s paper might give us the tools to make that analogy precise,” Balasubramanian said.

For example, the finding appears to support the emerging hypothesis that parts of the brain operate at a “critical point,” where every neuron influences the network as a whole. In physics, renormalization is performed mathematically at the critical point of a physical system, explained Sejnowski, a professor at the Salk Institute for Biological Studies in La Jolla, Calif. “So the only way it could be relevant to the brain is if it is at the critical point.”

There may be an even deeper message in the new work. Tishby sees it as a hint that renormalization, deep learning and biological learning fall under the umbrella of a single idea in information theory. All the techniques aim to reduce redundancy in data. Step by step, they compress information to its essence, a final representation in which no bit is correlated with any other. Cats convey their presence in many ways, for example, but deep neural networks pool the different correlations and compress them into the form of a single neuron. “What the network is doing is squeezing information,” Tishby said. “It’s a bottleneck.”

By laying bare the mathematical steps by which information is stripped down to its minimal form, he said, “this paper really opens up a door to something very exciting.”

Editor’s Note: Pankaj Mehta receives funding from the Simons Foundation as a Simons Investigator.

This article was reprinted on Wired.com.

View Reader Comments (19)

Leave a Comment

Reader CommentsLeave a Comment

  • This is exciting! Congratulations to Prankaj and David! This work seems to have highly significant importance in the field of physics and information technology (not my fields), but I also see a correlation between “compressing information to its essence” and humanity’s search for truth, and how the processes echo each other. I believe this work may have ramifications in many areas of study. We truly have much for which to be grateful, and it is work of this caliber that reminds us of our place in the world.

  • Is this the same renormalisation that is used to cancel infinities in particle interactions (I ask as that technique is not without its critics)?

  • “Step by step, they compress information to it’s essence…” reminds me of Pablo Picasso’s Abstraction of the Bull. Note how he compresses what the viewer needs to recognize a bull into just a line drawing at the end:
    http://www.artyfactory.com/art_appreciation/animals_in_art/pablo_picasso.htm

  • On a similar note physicists at CERN are working on what they call artificial retinas to recognize certain patterns in collision data at the LHC. They hope to incorporate these in their trigger systems in the future.

  • On topic: The “renormalization” technique that Kadanoff developed for Ising models is based on handling one parameter at a time and in the example given these are binary, e.g., spin or magnetic moment. In the real world you would be “renormalizing” an unknown number of different, possibly multi-state traits of a pattern. While doing this you would be wanting to eliminate unnecessary work by (1) collapsing highly correlated traits; (2) reducing to a minimum the states for each trait, i.e. maximizing the discriminatory power of cut-off points; and (3) including as many as possible traits that are essentially orthogonal to one another. Of course, what maximizes discriminatory power at one level might not at a higher scale. So this whole process would be massively recursive in real life. These kind of recursive processes seem to be something which animal brains are much better at than any hardware or software algorithms yet developed. I wonder what might be the reason.

    Off topic: I remember attending a series of colloquia on system dynamics models applied to urban growth that Kadanoff hosted at Brown University back in my grad student days. I went on to get involved with some system dynamics models in my first job out of grad school.

  • Re: @Jediphone – Pablo Picasso’s Abstraction of the Bull. Very interesting analogy, except Picasso is extracting the essence of what a bull is to him, rather than its essence in the Platonic sense. The end product reminded me of cave paintings, which in turn reminded me of stick figures in general. A child will spontaneously reduce a scene to stick figures in her drawing. Just a few lines becomes universally recognizable as a human figure. It would be interesting to develop neural networks that would reduce a scene to a stick figure representation as an intermediate step to furthering our understanding of using and training such networks.

  • @George Taylor: I think the stick figure drawing style may be a cultural thing that kids learn. I’m not sure they would automatically do this without ever seeing such examples before. I mean they would surely simplify, but not necessarily in the typical stick-figure style.

  • The perspective in “…strong similarities between deep learning and biological learning suggest that the brain may also employ a form of renormalization to make sense of the world.” may have the cart before the horse. Perhaps the brain functioning led to the realization (one morning, while awakening from a nice sleep) of a deep learning protocol .

  • — Why don’t we simply explain these accomplishments in terms of the computer programs involved without comparing them to a ‘mind’ of which we know almost nothing. We are not talking about the mind – we are talking about a machine. It seems absurd to compare something, of which we know everything down to the last ‘bit’, to something of which we know practically nothing.

  • “Is this the same renormalisation that is used to cancel infinities in particle interactions (I ask as that technique is not without its critics)?”

    No it is not. The technique you are referring to is called regularization.

  • @simeon and jesse73:

    Regularization and renormalization are related procedures and both are used to deal with infinities in particle physics. In that context, a theory is renormalized to different energy scales rather than length scales, but the methods used are part of the same class of ideas as the technique described in this article. Here’s a resource: http://www.lptmc.jussieu.fr/user/lesne/RG-WS-Lesne-v3-FinalVersion.pdf.

  • Regularization is the subset within renormalization techniques that deals with infinities whereas other parts are more generally focused on similarity of quantities in different scales. I think this is a common viewpoint which coincidentally was a point made in the second sentence in the second paragraph on renormalization.

    http://en.wikipedia.org/wiki/Renormalization

    “Renormalization was first developed in quantum electrodynamics (QED) to make sense of infinite integrals in perturbation theory. Initially viewed as a suspect provisional procedure even by some of its originators, renormalization eventually was embraced as an important and self-consistent actual mechanism of scale physics in several fields of physics and mathematics. Today, the point of view has shifted: on the basis of the breakthrough renormalization group insights of Kenneth Wilson, the focus is on variation of physical quantities across contiguous scales, while distant scales are related to each other through “effective” descriptions. “

  • “Maybe there is some universal logic to how you can pick out relevant features from data,” said Mehta. “I would say this is a hint that maybe something like that exists.” Sure there is, but it is not renormalization. It has been discovered already and it is called Causal Mathematical Logic. It follows directly from the fundamental principle of causality.

  • “This was very frustrating for the theoreticians in machine learning because they didn’t understand why it works so well.”

    Do they know now? Any comments or references?

  • I just spend an hour reading Mehta-Schwab paper from the beginning to end. Let me say that “A Common Logic to Seeing Cats and Cosmos” is a sensationalist article about a trivial paper, which will have no impact whatsoever. The whole M-S paper is based on the fact that couplings of two systems appear in more than one context and that distributions can sometimes appear as marginal distributions on product spaces. There is no one-to-one mappings between renormalization group (RG) scheme of Kadanoff and Restricted Boltzmann Machines (RBM) in Deep Neural Networks (DNN) in their paper. What they show is that RBM can be represented as a RG scheme with a very specific choice of coupling function T in equation (18). Conveniently, this coupling function depends on the Hamiltonian of the spin system, which it normally should not. Equivalence in equations (8) and (9) is also not correct. Condition (9) of course implies that the scheme is exact, but not the other way around, unless the authors make some implicit assumptions about coupling function T not mentioned in the paper. The paper contains no non-trivial ideas, it does not “open up a door to something very exciting”, and I will not hold my breath expecting new breakthroughs because of this connection.

  • Thank you for this article – very interesting, although I haven’t fully grasped what this new paper is about.

    However, I don’t agree with the idea that deep learning is doing some sort of compression, at least not in the normal engineering sense of that word. Compression is important for the primate visual system in getting signals from the retina to visual cortex. The optic nerve is clearly a bottleneck in the pathway, and some sort of compression must occur. For humans, there are ~6 million cone photoreceptors and ~120 million photoreceptors in each eye, but only ~1.5 million axons in the optic nerve. I’m not sure about LGN, but the cortical representation is then expanded in V1 to ~100 million neurons in each hemisphere. Retinal processing necessarily compresses the representation of an image from the full high-dimensional photoreceptor space.

    A simple way redundancy in the optic nerve signal could be reduced is through “decorrelating” the responses of retinal ganglion cells that compose the optic nerve. This idea has been around since Shannon and Attneave, but recent experimental results show that there are at least correlations between pairs of RGCs. The theoretical picture is greatly complicated by the newfound diversity in types of RGCs and how they share photoreceptor inputs. Obviously noise in the receptors and post-receptor retinal circuitry is a factor as well that may necessitate correlations in RGC firing in order to perform something like error correction. It also may be true that RGCs of each type actually are decorrelated from one another. All of this is just to say that compression really only occurs from the retina to V1.
    http://www.cnbc.cmu.edu/cns/papers/Puchalla-Schneidman-Harris-Berry-NE05.pdf
    Deep networks for object recognition are thought to be very high-level simulations of V1, V2, V4 and IT cortex, so they’re starting beyond any straightforward retinal compression (although they are usually trained on decorrelated/compressed images).

    Deep networks learn synaptic connections that associate certain high-order correlations between image features and ignoring other particular high-order correlations. This is learning both how to be “selective” to certain visual features (high-order correlations), like eyes relatively positioned above a mouth, while being “invariant” to/ignoring others, like the scale or position in the visual field of those same features. This is much like the argument here:
    http://www.rowland.harvard.edu/rjf/cox/pdfs/TICS_DiCarloCox_2007.pdf

  • Most Quanta articles are “sensationalizing” does any one remember the “jewel at the heart of quantum mechanics” ? What does Simons Foundation hope to achieve by this?

    Suppose you generated interest and some souls spent their lives on these topics, only to find out, that it wasn’t as “exciting” as it was portrayed to be,
    Wouldn’t their disillusionment cause more damage than good?

  • This articles highlights what is wrong with the recent trend that every physics department must have some sort of biophysics. This is neither good physics nor good biology (or its simulation).

  • Watch this related video:
    What The Brain Can Tell Us (2014 nov) (by Jeff Hawking, co-founder of Numenta.)
    https://www.youtube.com/watch?v=0SroCjwkSFc

Leave a Comment

Your email address will not be published. Your name will appear near your comment.

Quanta Magazine moderates all comments with the goal of facilitating an informed, substantive, civil conversation about the research developments we cover. Comments that are abusive, profane, self-promotional, misleading, incoherent or off-topic will be rejected. We can only accept comments that are written in English.

(Required)