A Surprise Source of Life’s Code

Emerging data suggests the seemingly impossible — that mysterious new genes arise from “junk” DNA.

[No Caption]

Skip Sterling for Quanta Magazine

Genes, like people, have families — lineages that stretch back through time, all the way to a founding member. That ancestor multiplied and spread, morphing a bit with each new iteration.

For most of the last 40 years, scientists thought that this was the primary way new genes were born — they simply arose from copies of existing genes. The old version went on doing its job, and the new copy became free to evolve novel functions.

Certain genes, however, seem to defy that origin story. They have no known relatives, and they bear no resemblance to any other gene. They’re the molecular equivalent of a mysterious beast discovered in the depths of a remote rainforest, a biological enigma seemingly unrelated to anything else on earth.

The mystery of where these orphan genes came from has puzzled scientists for decades. But in the past few years, a once-heretical explanation has quickly gained momentum — that many of these orphans arose out of so-called junk DNA, or non-coding DNA, the mysterious stretches of DNA between genes. “Genetic function somehow springs into existence,” said David Begun, a biologist at the University of California, Davis.

Olena Shmahalo/Quanta Magazine; source: Tautz and Domazet-Lošo, _Nature Reviews Genetics_, 2011.

New genes appear to burst into existence at various points along the evolutionary history of the mouse lineage (red line). The surge around 800 million years ago corresponds to the time when earth emerged from its “snowball” phase, when the planet was almost completely frozen. The very recent peak represents newly born genes, many of which will subsequently be lost. If all genes arose via duplication, they all would have been generated soon after the origins of life, roughly 3.8 billion years ago (green line).

This metamorphosis was once considered to be impossible, but a growing number of examples in organisms ranging from yeast and flies to mice and humans has convinced most of the field that these de novo genes exist. Some scientists say they may even be common. Just last month, research presented at the Society for Molecular Biology and Evolution in Vienna identified 600 potentially new human genes. “The existence of de novo genes was supposed to be a rare thing,” said Mar Albà, an evolutionary biologist at the Hospital del Mar Research Institute in Barcelona, who presented the research. “But people have started seeing it more and more.”

Researchers are beginning to understand that de novo genes seem to make up a significant part of the genome, yet scientists have little idea of how many there are or what they do. What’s more, mutations in these genes can trigger catastrophic failures. “It seems like these novel genes are often the most important ones,” said Erich Bornberg-Bauer, a bioinformatician at the University of Münster in Germany.

The Orphan Chase

The standard gene duplication model explains many of the thousands of known gene families, but it has limitations. It implies that most gene innovation would have occurred very early in life’s history. According to this model, the earliest biological molecules 3.5 billion years ago would have created a set of genetic building blocks. Each new iteration of life would then be limited to tweaking those building blocks.

Yet if life’s toolkit is so limited, how could evolution generate the vast menagerie we see on Earth today? “If new parts only come from old parts, we would not be able to explain fundamental changes in development,” Bornberg-Bauer said.

The first evidence that a strict duplication model might not suffice came in the 1990s, when DNA sequencing technologies took hold. Researchers analyzing the yeast genome found that a third of the organism’s genes had no similarity to known genes in other organisms. At the time, many scientists assumed that these orphans belonged to families that just hadn’t been discovered yet. But that assumption hasn’t proven true. Over the last decade, scientists sequenced DNA from thousands of diverse organisms, yet many orphan genes still defy classification. Their origins remain a mystery.

In 2006, Begun found some of the first evidence that genes could indeed pop into existence from noncoding DNA. He compared gene sequences from the standard laboratory fruit fly, Drosophila melanogaster, with other closely related fruit fly species. The different flies share the vast majority of their genomes. But Begun and collaborators found several genes that were present in only one or two species and not others, suggesting that these genes weren’t the progeny of existing ancestors. Begun proposed instead that random sequences of junk DNA in the fruit fly genome could mutate into functioning genes.

Courtesy of Diethard Tautz

Diethard Tautz, a biologist at the Max Planck Institute for Evolutionary Biology, once doubted whether de novo genes could exist. He now thinks they may actually be quite common.

Yet creating a gene from a random DNA sequence appears as likely as dumping a jar of Scrabble tiles onto the floor and expecting the letters to spell out a coherent sentence. The junk DNA must accumulate mutations that allow it to be read by the cell or converted into RNA, as well as regulatory components that signify when and where the gene should be active. And like a sentence, the gene must have a beginning and an end — short codes that signal its start and end.

In addition, the RNA or protein produced by the gene must be useful. Newly born genes could prove toxic, producing harmful proteins like those that clump together in the brains of Alzheimer’s patients. “Proteins have a strong tendency to misfold and cause havoc,” said Joanna Masel, a biologist at the University of Arizona in Tucson. “It’s hard to see how to get a new protein out of random sequence when you expect random sequences to cause so much trouble.” Masel is studying ways that evolution might work around this problem.

Another challenge for Begun’s hypothesis was that it’s very difficult to distinguish a true de novo gene from one that has changed drastically from its ancestors. (The difficulty of identifying true de novo genes remains a source of contention in the field.)

Ten years ago, Diethard Tautz, a biologist at the Max Planck Institute for Evolutionary Biology, was one of many researchers who were skeptical of Begun’s idea. Tautz had found alternative explanations for orphan genes. Some mystery genes had evolved very quickly, rendering their ancestry unrecognizable. Other genes were created by reshuffling fragments of existing genes.

Then his team came across the Pldi gene, which they named after the German soccer player Lukas Podolski. The sequence is present in mice, rats and humans. In the latter two species, it remains silent, which means it’s not converted into RNA or protein. The DNA is active or transcribed into RNA only in mice, where it appears to be important — mice without it have slower sperm and smaller testicles.

The researchers were able to trace the series of mutations that converted the silent piece of noncoding DNA into an active gene. That work showed that the new gene is truly de novo and ruled out the alternative — that it belonged to an existing gene family and simply evolved beyond recognition. “That’s when I thought, OK, it must be possible,” Tautz said.

A Wave of New Genes

Scientists have now catalogued a number of clear examples of de novo genes: A gene in yeast that determines whether it will reproduce sexually or asexually, a gene in flies and other two-winged insects that became essential for flight, and some genes found only in humans whose function remains tantalizingly unclear.

The Odds of Becoming a Gene

Scientists are testing computational approaches to determine how often random DNA sequences can be mutated into functional genes. Victor Luria, a researcher at Harvard, created a model using common estimates of the rates of mutation, recombination (another way of mixing up DNA) and natural selection. After subjecting a stretch of DNA as long as the human genome to mutation and recombination for 100 million generations, some random stretches of DNA evolved into active genes. If he were to add in natural selection, a genome of that size could generate hundreds or even thousands of new genes.

At the Society for Molecular Biology and Evolution conference last month, Albà and collaborators identified hundreds of putative de novo genes in humans and chimps — ten-fold more than previous studies — using powerful new techniques for analyzing RNA. Of the 600 human-specific genes that Albà’s team found, 80 percent are entirely new, having never been identified before.

Unfortunately, deciphering the function of de novo genes is far more difficult than identifying them. But at least some of them aren’t doing the genetic equivalent of twiddling their thumbs. Evidence suggests that a portion of de novo genes quickly become essential. About 20 percent of new genes in fruit flies appear to be required for survival. And many others show signs of natural selection, evidence that they are doing something useful for the organism.

In humans, at least one de novo gene is active in the brain, leading some scientists to speculate such genes may have helped drive the brain’s evolution. Others are linked to cancer when mutated, suggesting they have an important function in the cell. “The fact that being misregulated can have such devastating consequences implies that the normal function is important or powerful,” said Aoife McLysaght, a geneticist at Trinity College in Dublin who identified the first human de novo genes.

Promiscuous Proteins

De novo genes are also part of a larger shift, a change in our conception of what proteins look like and how they work. De novo genes are often short, and they produce small proteins. Rather than folding into a precise structure — the conventional notion of how a protein behaves — de novo proteins have a more disordered architecture. That makes them a bit floppy, allowing the protein to bind to a broader array of molecules. In biochemistry parlance, these young proteins are promiscuous.

Scientists don’t yet know a lot about how these shorter proteins behave, largely because standard screening technologies tend to ignore them. Most methods for detecting genes and their corresponding proteins pick out long sequences with some similarity to existing genes. “It’s easy to miss these,” Begun said.

That’s starting to change. As scientists recognize the importance of shorter proteins, they are implementing new gene discovery technologies. As a result, the number of de novo genes might explode. “We don’t know what things shorter genes do,” Masel said. “We have a lot to learn about their role in biology.”

Scientists also want to understand how de novo genes get incorporated into the complex network of reactions that drive the cell, a particularly puzzling problem. It’s as if a bicycle spontaneously grew a new part and rapidly incorporated it into its machinery, even though the bike was working fine without it. “The question is fascinating but completely unknown,” Begun said. 

A human-specific gene called ESRG illustrates this mystery particularly well. Some of the sequence is found in monkeys and other primates. But it is only active in humans, where it is essential for maintaining the earliest embryonic stem cells. And yet monkeys and chimps are perfectly good at making embryonic stem cells without it. “It’s a human-specific gene performing a function that must predate the gene, because other organisms have these stem cells as well,” McLysaght said.

“How does novel gene become functional? How does it get incorporated into actual cellular processes?” McLysaght said. “To me, that’s the most important question at the moment.”

This article was reprinted on

View Reader Comments (20)

Leave a Comment

Reader CommentsLeave a Comment

  • That is not a graph of the Dreambeat of New Genes, it is a graph of the Dreambeat of SUCCESSFUL New Genes, i.e., genes that have survived to the present time. It is more likely that to a first order, the rate of mutations per unit time is proportional to the number of genes in existence near the surface of the earth, and to second order, environmental factors such as the amount of solar radiation the reaches the earth's surface. The question raised by the graph is what factors caused the success of these mutations to vary over time.

  • I have heard that one source of "junk DNA" is from viral infections. Where that is the case, we are talking about genes which had a function somewhere else, either in the virus itself or in a previously infected organism. DNA of this origin may have some portions which are partial genes but some of the material would be genes transported to the new host made "out of whole cloth" and might become functional under some circumstances.

  • The phenomenon of horizontal gene transfer merits at least a mention when discussing orphan genes as a possible origin for the existence of novel, functional genes unattributable to gene duplication events.

  • The credibility of this article is suspect considering an already known source of de-novo genes, viruses, is completely ignored. New hypothesis are always welcome, but they should be presented in the context of known theories.

  • The graph is fascinating, showing a peak c. 100 mya that would be the demise of the dinosaurs and rise of the mammals, another, lower peak at c. 350 mya that would be the permian extinction event, and a peak c. 600 mya that would be the cambrian explosion when multicellular organisms started to grow legs and eyes, move around (and eat the ones that didn't move). Then there is the peak that the article points out around 750 mya that ties in nicely with the thawing of snowball earth.

  • Thank you for your very interesting article. In a sidebar, you mentioned that Victor Luria from Harvard has created a model to estimate the frequency of de novo gene creation. After a careful internet search, I have not been able to find his model. Has it been published? Do you have the journal citation?
    Again, thank you for your excellent article.

  • Thanks for your comment. That research hasn't yet been published. Luria presented his work at the Society for Molecular Biology and Evolution conference in Vienna in July.

  • Thanks for your comment. I asked researcher Diethard Tautz about horizontal gene transfer and de novo genes.
    Tautz: Yes, all the considerations around orphan genes have always also considered the possibility of horizontal gene transfer. There was originally indeed uncertainty about the fraction of orphans that might eventually be found to have originated via horizontal gene transfer. But we have now such a broad representation of sequenced taxa that one can usually trace the origins of every gene. For example, if a gene exists in a rare bacterial species and in mammals, it might previously (i.e. before finding the rare bacterium) have been classified as orphan in mammals, but would now be classified as having originated at the origin of life. However, this is a classical conservative explanation. The alternative would be that it was indeed a de novo gene in mammals that was horizontally transferred into the bacterial species. Hence, by taking the possibility of de novo gene evolution into account, one gets also a new interpretation framework for cases of horizontal gene transfer.

  • I am responding to Jim Beed and L Skeptic:
    Pieces of dead viruses (or other repetitive elements) can become incorporated into de novo genes, but if their respective coding parts would be incorporated, they would be classified as genes coming from these viruses or elements. Keep in mind that pieces of such DNA can be used as antisense as well as in different reading frames compared to the original sequence.

    Viruses are certainly not ignored, but can be eexcluded in most cases. Given that the identification of de novo evolved genes requires that one finds the corresponding piece of „junk“ DNA in the outgroups, one can exclude the insertion of a viral sequence at the respective location. But viruses can of course be involved in mediating horizontal gene transfers, but then my previous response applies. Also, de novo genes can indeed evolve even within viruses, interestingly often by making use of an existing reading frame through overprinting (see e.g.Pavesi et al. Viral proteins originated de novo by overprinting can be identified by codon usage: application to the "gene nursery" of Deltaretroviruses. PLoS Comput Biol. 2013;9(8):e1003162), a fact that is often used as additional evidence for the power of de novo evolution.

  • Re: Viruses are certainly not ignored, but can be eexcluded in most cases.

    In a recent interview, Eugene Koonin made the opposite claim. See:

    Excerpt: "The entire evolution of the microbial world and the virus world, and the interaction between microbes and viruses and other life forms have been left out of the Modern Synthesis…"

  • "Then his team came across the Pldi gene, which they named after the German soccer player Lukas Podolski."- ?
    "On the basis of these characteristics, we have named the gene POLymorphic Derived Intron-containing (Poldi)"
    (Heinen TJ1, Staubach F, Häming D, Tautz D. "Emergence of a new gene from an intergenic region." Curr Biol. 2009 Sep 29;19(18):1527-31. doi: 10.1016/j.cub.2009.07.049. Epub 2009 Sep 3.)

  • "Yet creating a gene from a random DNA sequence appears as likely as dumping a jar of Scrabble tiles onto the floor and expecting the letters to spell out a coherent sentence. The junk DNA must accumulate mutations that allow it to be read by the cell or converted into RNA, as well as regulatory components that signify when and where the gene should be active. And like a sentence, the gene must have a beginning and an end — short codes that signal its start and end."

    I'm not certain how important the phenomenon is. It seems hard to calculate the contributions of viruses and lateral transfer to these genes. But I'm also not certain whether the comparison to Scrabble tiles spelling out coherent sentences is very compelling. New start/stop codons that originate by mutation or frame shift are inevitable. The notion that transcription depends solely upon regulatory genes so far as I know is just wrong. Studies of gene expression find many, many short mRNAs etc. which I thought demonstrated that transcription was an inevitable process, that natural selection of regulatory genes is not so all powerful that only chosen exons are transcribed, with no "waste" of transcribing segments of base pairs without designation from the regulatory genes. It seems doubtful that regulatory genetic determinism is any more apt at panselectionist perfection than coding genetic determinism.

  • For those who complain because this article ignores horizontal transfer and viral sequences as the possible sources of new genes, I assume that the new genes referred in the article correspond to those having sequences already in the "junk" DNA where they arose. These genes, of course, need to be subject to natural selection.

  • Hasn't it been long since time to drop the term 'junk' when describing DNA sequences of indeterminate function? Or is this just the 'click-bait' part of the article?

  • José Moreno's comment saying that de novo genes are those having sequences already in the junk DNA is correct. This means that we can find the corresponding genomic sequence, but not the expressed gene, in closely related species. The first papers published by Begun and Jones used this definition and we are employing the same criteria to identify de novo genes in humans in our latest work

  • The evidence for "junk" DNA is pretty strong–

  • Formerly termed "JUNK DNA" is not just dark. From the virus-first perspective all this "useless" DNA represents remnants of former viral infection events that now act in most cases as transcribed non coding RNAs in gene regulation in all processes of life such as transcription, translation, recombination, repair and immunity. From a physico chemical perspective de novo genes clearly derive from mutations, i.e. error replication events, from a biocommunicative perspective de novo genes are the result of natural genome editing competencies of persistent viruses that edit host genomes. This is not" error" but productivity.

  • So, "It’s as if a bicycle spontaneously grew a new part and rapidly incorporated it into its machinery, even though the bike was working fine without it." As far as I am aware a bicycle is not challenged by anti-bicycle viruses or other pathogens. It does not need an immune system. But we do not "work fine" without an immune system that acts both extracellularly (e.g. protein antibodies) and intracellularly (antibody-like RNAs and proteins).

    The view that "junk" DNA is the source of a wide spectrum of "antibody RNAs" that have the potential to hybridize with the nucleic acids introduced by intracellular pathogens has long been on the table. Sometimes, by mutation and subsequent selection at the RNA or protein level , these will be the source of new genes with other than immune function.

Comments are closed.