The computational biologist Bruno Correia used to have a rule in his lab: No machine learning allowed. He didn’t consider it real science. Now Correia has used it to detect potential interactions between proteins — the complex folded molecules responsible for many biological processes — 40,000 times faster than conventional methods. The journal Nature Methods featured his system on its cover in February 2020. Correia said of his early reluctance to embrace machine learning, “I was wrong, and I’m glad I was wrong.”
What changed his mind? Geometric deep learning: an emerging subfield of artificial intelligence that can learn patterns on curved surfaces.
Proteins interact by fitting their bumpy, irregular shapes together like three-dimensional puzzle pieces. Researchers have spent decades trying to figure out how they do so. The well-known protein folding problem, which has challenged scientists since the mid-20th century, attempts to understand protein interaction by decoding the link between a protein’s constituent amino acids and its final 3D shape. In 1999, IBM began developing its line of Blue Gene supercomputers to tackle the folding problem; 20 years later, DeepMind applied state-of-the-art deep learning algorithms to it.
Correia’s system, called MaSIF (short for molecular surface interaction fingerprinting), avoids the inherent complexity of a protein’s 3D shape by ignoring the molecules’ internal structure. Instead, the system scans the protein’s 2D surface for what the researchers call interaction fingerprints: features learned by a neural network that indicate that another protein could bind there. “The idea [is that when] any two molecules come together, what they’re essentially presenting to one another is that surface. So that’s all you need,” said Mohammed AlQuraishi, a protein researcher at Harvard Medical School who also uses deep learning. “It’s very, very innovative.”
MaSIF’s surface-focused framework for predicting protein interactions could help accelerate so-called de novo protein design, which tries to synthesize useful proteins from scratch rather than relying on the naturally occurring variety. But it could also be used for basic biology, said Michael Bronstein, a geometric deep learning expert at Imperial College London who helped develop the system. “How does cancer affect protein properties?” he said. “You can ask whether mutations as a result of cancer destroy something in the protein that makes them work in a different way, by not binding to what they are supposed to. [MaSIF] could answer fundamental questions.”
If you want to understand how deep learning can create protein fingerprints, Bronstein suggests looking at digital cameras from the early 2000s. Those models had face detection algorithms that did a relatively simple job. “You just need to detect that there is a face” — eyes, a nose, a mouth — “regardless of whether it has a long nose or a short nose, fat lips or thin lips,” he explained.
Modern cameras are more versatile. They can identify a particular person, allowing you to quickly search through your photo library to find all the photos they’re in.
This advance was made possible by deep neural networks, which gave computers a way to learn an individual’s subtle features from training data. The process involves feeding many instances of a particular face to the network and labeling them all as the same person. You don’t have to tell the computer in advance which exact mixture of attributes — green eyes, wide-set eyebrows, black hair — somehow adds up to your own face rather than another person’s. Instead, with enough properly labeled examples, the network learns the distinction itself.
MaSIF does the same thing for proteins. Previous approaches to interaction fingerprinting were like the basic face detection algorithms. They required researchers to define certain geometric patterns in advance — say, a bumpy patch on the surface of a protein with a specific shape and size — and then search for matches. MaSIF, by contrast, starts with a handful of basic surface features known to be associated with protein interactions: for instance, the surface’s physical curvature (into a knob or pocket), its electrical charge, and whether it repels or attracts water. Then, during training, the network learns how to combine these features into fingerprints that detect different higher-level patterns.
Until recently, this kind of machine learning couldn’t be used on the curved, irregular surfaces of proteins. The rise of geometric deep learning opened up the possibility. Correia credits Bronstein with bringing the method to his attention during a two-week collaboration at Bronstein’s home in February 2018. “It was totally him,” said Correia, who’s based at the École Polytechnique Fédérale de Lausanne. “Our handcrafted descriptors were going nowhere.”
One version of the system, called MaSIF-site, can examine the whole surface of a protein and predict where another protein is most likely to bind, an approach similar to painting a target on a curved canvas. “It’s what we like to call the one-body problem,” Correia said. “You can think about this as a way to understand where the functional sites on a particular protein are.” MaSIF-site performed roughly 25% better at this task than two leading site-interaction predictors.
Another version of the system, called MaSIF-search, tackles what Correia calls the many-to-many problem: Instead of predicting how one protein will fit together with one target molecule (as typically happens in docking simulations), the system compares the interaction fingerprints of many proteins to many others, looking for fits. (“In a cell you have 10,000 proteins, and many of them are bumping into each other all the time,” explained Correia.) On this task, MaSIF didn’t outperform a leading molecular-docking predictor; it found roughly half as many potential fits within a random set of 100 proteins. But the docking predictor needed nearly 100 days’ worth of computing time to perform its search. MaSIF took four minutes.
That massive speedup “opens interesting possibilities” for basic research, said Bronstein. After all, in the human body, proteins form functional networks comprising tens of thousands of interactions. “Constructing these graphs takes a lot of time,” Bronstein said. “With methods [like MaSIF], it may only be an approximation, but it allows you to at least build some rough version of these protein-to-protein networks for any organism.”
AlQuraishi noted that while MaSIF’s skin-deep approach to predicting protein interactions made sense, it wasn’t able to capture a phenomenon called induced fit: the way molecular surfaces change shape (and chemistry) when they get close to each other. In other words, the surfaces of two proteins may not exhibit complementary fingerprints until they’re already almost touching — a factor MaSIF will miss, since induced fit depends on the structure beneath a protein’s surface. “What evolution is probably optimizing for is precisely this induced fit,” said AlQuraishi. “What’s surprising about [MaSIF] is that even with this caveat, it still works pretty well.”
Incorporating induced fit and other surface dynamics into MaSIF is something Correia plans to explore. “To me it’s the last frontier of understanding [protein] function,” he said. “That’s probably how I’m going to be spending my next 10 years.” But at the moment he has other pressing business: using MaSIF to scan the spike-shaped proteins that stud the surface of SARS-CoV-2, the virus that causes COVID-19. “We are trying to see what fingerprints are in that virus,” he said. “It does seem like the virus has some places where we could try to attack it, besides the ones that we already knew.” Correia is already using this information about SARS-CoV-2 to synthesize antiviral proteins from scratch; he hopes to publish results this year. “If we could design new proteins based on the surface fingerprints of the viral protein in order to inhibit the way the virus invades host cells, that would be pretty exciting,” he said. “That’s what gets me out of bed.”