Machine Intelligence Cracks Genetic Controls

Every recipe has both instructions and ingredients. So does the human genome. An error in the instructions can raise the risk for disease.

Olena Shmahalo/Quanta Magazine

Every recipe has both instructions and ingredients. So does the human genome. An error in the instructions can raise the risk for disease.


Every cell in your body reads the same genome, the DNA-encoded instruction set that builds proteins. But your cells couldn’t be more different. Neurons send electrical messages, liver cells break down chemicals, muscle cells move the body. How do cells employ the same basic set of genetic instructions to carry out their own specialized tasks? The answer lies in a complex, multilayered system that controls how proteins are made.

Most genetic research to date has focused on just 1 percent of the genome — the areas that code for proteins. But new research, published today in Science, provides an initial map for the sections of the genome that orchestrate this protein-building process. “It’s one thing to have the book — the big question is how you read the book,” said Brendan Frey, a computational biologist at the University of Toronto who led the new research.

Frey compares the genome to a recipe that a baker might use. All recipes include a list of ingredients — flour, eggs and butter, say — along with instructions for what to do with those ingredients. Inside a cell, the ingredients are the parts of the genome that code for proteins; surrounding them are the genome’s instructions for how to combine those ingredients.

Just as flour, eggs and butter can be transformed into hundreds of different baked goods, genetic components can be assembled into many different configurations. This process is called alternative splicing, and it’s how cells create such variety out of a single genetic code. Frey and his colleagues used a sophisticated form of machine learning to identify mutations in this instruction set and to predict what effects those mutations have.


Olena Shmahalo/Quanta Magazine

The researchers have already identified possible risk genes for autism and are working on a system to predict whether mutations in cancer-linked genes are harmful. I hope this paper will have a big impact on the field of human genetics by providing a tool that geneticists can use to identify variants of interest,” said Chris Burge, a computational biologist at the Massachusetts Institute of Technology who was not involved in the study.

But the real significance of the research may come from the new tools it provides for exploring vast sections of DNA that have been very difficult to interpret until now. Many human genetics studies have sequenced only the small part of the genome that produces proteins. “This makes an argument that the sequence of the whole genome is important too,” said Tom Cooper, a biologist at Baylor College of Medicine in Houston, Texas.

Reading the Recipe

The splicing code is just one part of the noncoding genome, the area that does not produce proteins. But it’s a very important one. Approximately 90 percent of genes undergo alternative splicing, and scientists estimate that variations in the splicing code make up anywhere between 10 and 50 percent of all disease-linked mutations. “When you have mutations in the regulatory code, things can go very wrong,” Frey said.

“People have historically focused on mutations in the protein-coding regions, to some degree because they have a much better handle on what these mutations do,” said Mark Gerstein, a bioinformatician at Yale University, who was not involved in the study. “As we gain a better understanding of [the DNA sequences] outside of the protein-coding regions, we’ll get a better sense of how important they are in terms of disease.”

Scientists have made some headway into understanding how the cell chooses a particular protein configuration, but much of the code that governs this process has remained an enigma. Frey’s team was able to decipher some of these regulatory regions in a paper published in 2010, identifying a rough code within the mouse genome that regulates splicing. Over the past four years, the quality of genetics data — particularly human data — has improved dramatically, and machine-learning techniques have become much more sophisticated, enabling Frey and his collaborators to predict how splicing is affected by specific mutations at many sites across the human genome. “Genome-wide data sets are finally able to enable predictions like this,” said Manolis Kellis, a computational biologist at MIT who was not involved in the study.

Courtesy of University of Toronto

Video: Brendan Frey discusses his team’s deep-learning approach to understand the genome, and why he believes “the technique we’ve discovered will transform medicine.”

Frey’s team used an approach called deep learning. Like any kind of machine-learning technique, the model tries to find a relationship between two sets of data. In this case, Frey’s team connected the human reference genome with rich data sets cataloging the amounts of different protein components in different tissues. (Just as two different cake recipes vary in their ratios of flour and sugar, brain cells and liver cells vary in how much of each protein they produce.) In essence, the algorithms trained a computational model to read instructions embedded in the DNA.

While scientists already knew how to read some aspects of the splicing code, the new model is unique. It allows scientists to predict how a wide array of genetic components will interact. “This group took what we knew about splicing and put it into a computational framework where we can weight all [the variables],” Burge said.

For example, researchers can use the model to predict what will happen to a protein when there’s a mistake in part of the regulatory code. Mutations in splicing instructions have already been linked to diseases such as spinal muscular atrophy, a leading cause of infant death, and some forms of colorectal cancer. In the new study, researchers used the trained model to analyze genetic data from people afflicted with some of those diseases. The scientists identified some known mutations linked to these maladies, verifying that the model works. They picked out some new candidate mutations as well, most notably for autism.

One of the benefits of the model, Frey said, is that it wasn’t trained using disease data, so it should work on any disease or trait of interest. The researchers plan to make the system publicly available, which means that scientists will be able to apply it to many more diseases.

A Broader Context

The model also reveals that when it comes to the genome, “context is important, just like in English,” Frey said. “‘Cat’ means different things whether we’re talking about pets or construction equipment.” In the same way, how the cell interprets a set of splicing instructions depends on other nearby instructions. A string of DNA that means “make lots of component X” might mean “don’t make component X” when it’s sitting near a second set of instructions. “Whether a sequence has an effect depends on whether another sequence has an effect,” Frey said. “Without understanding that, it’s hard to predict how a pattern will affect splicing.”

In addition, the model could help scientists reconsider known mutations, Burge said. Researchers already knew that some splicing instructions are found within protein-coding regions. In these cases, the same genetic sequence might code for both an ingredient and an instruction for what to do with it. (Consider whipped cream — it’s an ingredient, but it’s also in some ways an instruction.) A mutation in this protein-coding region might be dismissed as unimportant if it appears to do little or nothing to alter the corresponding protein. But when interpreted using the splicing code, that mutation might be found to have a profound effect by interfering with the splicing instructions. Frey’s group found many examples of these errors across the genome.

Frey hopes the model will ultimately prove useful for personalized medicine. For example, doctors cannot yet determine whether healthy people with novel mutations are predisposed to diseases like cancer. With further validation, Frey’s model might help to answer this question. “We can analyze any mutation, even those that haven’t been identified yet,” Frey said. This allows researchers to predict whether a novel mutation is likely to be dangerous or harmless — in essence, performing a screening test. “I want to see it have a huge impact on medicine,” he said. “I want to translate this into practice.”

This article was reprinted on

View Reader Comments (5)

Leave a Comment

Reader CommentsLeave a Comment

  • This is very interesting work! I have always expected that the so-called “junk” DNA did more than people realized. I wonder if Dr. Frey’s group can apply their insights to dementia. I’ll look forward to seeing what comes of this.

  • How does this research fit in with the epigenetic work that non encoding RNA that controls how some genes are expressed, or not?

  • @Michael: Good question. The actual discovery here, as I understand it from the article and video (sadly I don’t work in academia anymore so I don’t have access to the Science paper), is a mathematical model that allows prediction of the effect of mutations on splicing. That fits with regulation of gene expression by non-coding RNAs in 2 ways. 1) the mutations whose effect is described in the described here might be altering non-coding RNA sequence directly, or 2) presumably, the non-coding RNAs that affect splicing do it after recognizing their targets through base pairing. Some of the rules in the model will probably reflect the loss or gain of target recognition.

  • @Dani: It’s not a mathematical model, it’s a neural network that has been trained to recognize these splicing sequences. All though the “model” may contain rules, nobody knows what these are or why the network really works. It’s probably based on statistics so false “hits” must be anticipated. See and

  • In addition to changes in intronic regions as well as in protein coding regions that can adversely impact splicing, the importance of synonymous mutations (DNA changes that don’t change the amino acid but can have regulatory impact) cannot be overstated. Most of the ongoing vast genome sequencing efforts rarely report on the impact of synonymous changes and this machine learning attempt, despite the note on false hits, is an important step towards understanding the instructions of the code.

Comments are closed.