In the 1970s, when biologists first glimpsed the landscape of human genes, they saw that the small pieces of DNA that coded for proteins (known as exons) seemed to float like bits of wood in a sea of genetic gibberish. What on earth were those billions of other letters of DNA there for? No less a molecular luminary than Francis Crick, co-discoverer of DNA’s double-helical structure, suspected it was “little better than junk.”
The phrase “junk DNA” has haunted human genetics ever since. In 2000, when scientists of the Human Genome Project presented the first rough draft of the sequence of bases, or code letters, in human DNA, the initial results appeared to confirm that the vast majority of the sequence—perhaps 97 percent of its 3.2 billion bases—had no apparent function. The “Book of Life,” in other words, looked like a heavily padded text.
Now, in a series of papers published in September in Nature (Scientific American is part of Nature Publishing Group) and elsewhere, the ENCODE group has produced a stunning inventory of previously hidden switches, signals and sign posts embedded like runes throughout the entire length of human DNA. In the process, the ENCODE project is reinventing the vocabulary with which biologists study, discuss and understand human inheritance and disease.
Ewan Birney, 39, of the European Bioinformatics Institute in Cambridge, England, led the analysis by the more than 400 ENCODE scientists who annotated the genome. He recently spoke with Scientific American about the major findings. Excerpts follow.
Scientific American: The ENCODE project has revealed a landscape that is absolutely teeming with important genetic elements—a landscape that used to be dismissed as “junk DNA.” Were our old views of how the genome is organized too simplistic?
BIRNEY: People always knew there was more there than protein-coding genes. It was always clear that there was regulation. What we didn’t know was just quite how extensive this was.
Just to give you a sense here, about 1.2 percent of the bases are in protein-coding exons. And people speculated that “maybe there’s the same amount again involved in regulation or maybe a little bit more.” But even if we take quite a conservative view from our ENCODE data, we end up with something like 8 to 9 percent of the bases of the genome involved in doing something like regulation.
Thus, much more of the genome is devoted to regulating genes than to the protein-coding genes themselves?
And that 9 percent can’t be the whole story. The most aggressive view of the amount we’ve sampled is 50 percent. So certainly it’s going to go above 9 percent, and one could easily argue for something like 20 percent. That’s not an unfeasible number.
Should we be retiring the phrase “junk DNA” now?
Yes, I really think this phrase does need to be totally expunged from the lexicon. It was a slightly throwaway phrase to describe very interesting phenomena that were discovered in the 1970s. I am now convinced that it’s just not a very useful way of describing what’s going on.
What is one surprise you have had from the “junk”?
There has been a lot of debate, inside of ENCODE and outside of the project, about whether or not the results from our experiments describe something that is really going on in nature. And then there was a rather more philosophical question, which is whether it matters. In other words, these things may biochemically occur, but evolution, as it were, or our body doesn’t actually care.
That debate has been running since 2003. And then work by ourselves, but also work outside of the consortium, has made it much clearer that the evolutionary rules for regulatory elements are different from those for protein-coding elements. Basically the regulatory elements turn over a lot faster. So whereas if you find a particular protein-coding gene in a human, you’re going to find nearly the same gene in a mouse most of the time, and that rule just doesn’t work for regulatory elements.
In other words, there is more complex regulation of genes, and more rapid evolution of these regulatory elements, in humans?
That’s a rather different way of thinking about genes—and evolution.
I get this strong feeling that previously I was ignorant of my own ignorance, and now I understand my ignorance. It’s slightly depressing as you realize how ignorant you are. But this is progress. The first step in understanding these things is having a list of things that one has to understand, and that’s what we’ve got here.
Earlier studies suggested that only, say, 3 to 15 percent of the genome had functional significance—that is, actually did something, whether coding for proteins, regulating how the genes worked or doing something else. Am I right that the ENCODE data imply, instead, that as much as 80 percent of the genome may be functional?
One can use the ENCODE data and come up with a number between 9 and 80 percent, which is obviously a very big range. What’s going on there? Just to step back, the DNA inside of our cells is wrapped around various proteins, most of them histones, which generally work to keep everything kind of safe and happy. But there are other types of proteins called transcription factors, and they have specific interactions with DNA. A transcription factor will bind only at 1,000 places, or maybe the biggest bind is at 50,000 specific places across the genome. And so, when we talk about this 9 percent, we’re really talking about these very specific transcription-factor-to-DNA contacts.
On the other hand, the copying of DNA into RNA seems to happen all the time—about 80 percent of the genome is actually transcribed. And there is still a raging debate about whether this large amount of transcription is a background process that’s not terribly important or whether the RNA that is being made actually does something that we don’t yet know about.
Personally, I think everything that is being transcribed is worth further exploration, and that’s one of the tasks that we will have to tackle in the future.
There is a widespread perception that the attempts to identify common genetic variants related to human disease through so-called genome-wide association studies, or GWAS, have not revealed that much. Indeed, the ENCODE results now show that about 75 percent of the DNA regions that the GWAS have previously linked to disease lie nowhere near protein-coding genes. In terms of disease, have we been wrong to focus on mutations in protein-coding DNA?
Genome-wide association studies are very interesting, but they are not some magic bullet for medicine. The GWAS situation had everyone sort of scratching their heads. But when we put these genetic associations alongside the ENCODE data, we saw that although the loci are not close to a protein-coding gene, they really are close to one of these new elements that we’re discovering. That’s been a lovely thing. In fact, when I first saw it, it was a slightly too-good-to-be-true moment. And we spent a long time double-checking everything.
How does that discovery help us understand disease?
It’s like opening a door. Think about all the different ways you can study a particular disease, such as Crohn’s: Should we look at immune system cells in the gut? Or should we look at the neurons that fire to the gut? Or should we be looking at the stomach and how it does something else?
All those are options. Now suddenly ENCODE is letting you examine those options and say, “Well, I really think you should start by looking at this part of the immune system—the helper T cells— first.” And we can do that for a very, very big set of diseases. That’s really exciting.
Now that we are retiring the phrase “junk DNA,” is there another, better metaphor that might explain the emerging view of the genetic landscape?
What it feels like is genuinely a jungle—a completely dense jungle of stuff that you have to work your way through. You’re trying to hack your way to a certain position. And you’re really not sure where you are, you know? It’s quite easy to feel lost in there.
Over the past 20 years the public has been repeatedly told that these big genomic projects—starting with the Human Genome Project and going on through various other projects—were going to explain everything we needed to know about the “book of life.” Is ENCODE simply the latest in this sequence?
I think that each time we always said, “These are foundations. You build on them.” Nobody said, “Look, the human genome bases, that’s it. It’s all done and dusted—we’ve just got a bit of code breaking to do here.” Everybody said, “We’re going to be studying this for 50 years, 100 years. But this is the foundation that we start on.” I do get the feeling that the ENCODE project is the next layer in that foundational resource for other people to stand on top of and look further. The biggest change here is in our list of known unknowns. And I think people should understand that although finding out how much you don’t know can feel regressive and frustrating, identifying the gaps is really good.
Ten years ago we didn’t know what we didn’t know. There is no doubt that ENCODE poses many, many, many more questions than it directly answers. At the same time, for Crohn’s disease, say, and lots of other things, there are some effectively quick wins and low-hanging fruit—at least for researchers—where you start to say to people, “Oh my gosh, have you looked there?”
It’s just one more step. It’s an important step, but nowhere near the end, I’m afraid.
You sometimes refer to yourself as ENCODE’s “cat herder in chief.” How many people were involved in the consortium, and what was it like coordinating such a massive effort?
This is very much a different way of doing science. I am only one of 400 investigators, and I am the person who is charged to make sure that the analysis was delivered and that it all worked out. But I had to draw on the talents of many, many people.
So I’m more like the cat herder, the conductor, necessarily, than someone whose brain can absorb all of this. It comes back to that sense that it’s a bit of a jungle out there.
Well, you deserve a lot of credit. It’s more than just cats. They’re pretty opinionated cats.
Yeah, they are. What scientists are not are dogs. Dogs naturally run in packs. Cats? No. And I think that sums up the normal scientific phenotype. And so you have to cajole these people sometimes into sort of taking the same direction.
Do you see a point where all this complex information will resolve into a simpler message about human inheritance and human disease? Or do we have to accept the fact that complexity is, as it were, in our DNA?
We are complex creatures. We should expect that it’s complex out there. But I think we should be happy about that and maybe even proud about it.