In the 1970s, when biologists first glimpsed the landscape of human genes, they saw that the small pieces of DNA that coded for proteins (known as exons) seemed to float like bits of wood in a sea of genetic gibberish. What on earth were those billions of other letters of DNA there for? No less a molecular luminary than Francis Crick, co-discoverer of DNA’s double-helical structure, suspected it was “little better than junk.”
The phrase “junk DNA” has haunted human genetics ever since. In 2000, when scientists of the Human Genome Project presented the first rough draft of the sequence of bases, or code letters, in human DNA, the initial results appeared to confirm that the vast majority of the sequence—perhaps 97 percent of its 3.2 billion bases—had no apparent function. The “Book of Life,” in other words, looked like a heavily padded text.
Now, in a series of papers published in September in Nature (Scientific American is part of Nature Publishing Group) and elsewhere, the ENCODE group has produced a stunning inventory of previously hidden switches, signals and sign posts embedded like runes throughout the entire length of human DNA. In the process, the ENCODE project is reinventing the vocabulary with which biologists study, discuss and understand human inheritance and disease.
Ewan Birney, 39, of the European Bioinformatics Institute in Cambridge, England, led the analysis by the more than 400 ENCODE scientists who annotated the genome. He recently spoke with Scientific American about the major findings. Excerpts follow.
Scientific American: The ENCODE project has revealed a landscape that is absolutely teeming with important genetic elements—a landscape that used to be dismissed as “junk DNA.” Were our old views of how the genome is organized too simplistic?
BIRNEY: People always knew there was more there than protein-coding genes. It was always clear that there was regulation. What we didn’t know was just quite how extensive this was.
Just to give you a sense here, about 1.2 percent of the bases are in protein-coding exons. And people speculated that “maybe there’s the same amount again involved in regulation or maybe a little bit more.” But even if we take quite a conservative view from our ENCODE data, we end up with something like 8 to 9 percent of the bases of the genome involved in doing something like regulation.
Thus, much more of the genome is devoted to regulating genes than to the protein-coding genes themselves?
And that 9 percent can’t be the whole story. The most aggressive view of the amount we’ve sampled is 50 percent. So certainly it’s going to go above 9 percent, and one could easily argue for something like 20 percent. That’s not an unfeasible number.
Should we be retiring the phrase “junk DNA” now?
Yes, I really think this phrase does need to be totally expunged from the lexicon. It was a slightly throwaway phrase to describe very interesting phenomena that were discovered in the 1970s. I am now convinced that it’s just not a very useful way of describing what’s going on.
What is one surprise you have had from the “junk”?
There has been a lot of debate, inside of ENCODE and outside of the project, about whether or not the results from our experiments describe something that is really going on in nature. And then there was a rather more philosophical question, which is whether it matters. In other words, these things may biochemically occur, but evolution, as it were, or our body doesn’t actually care.
That debate has been running since 2003. And then work by ourselves, but also work outside of the consortium, has made it much clearer that the evolutionary rules for regulatory elements are different from those for protein-coding elements. Basically the regulatory elements turn over a lot faster. So whereas if you find a particular protein-coding gene in a human, you’re going to find nearly the same gene in a mouse most of the time, and that rule just doesn’t work for regulatory elements.