"We all suspected there was interesting stuff going on in these regions [between genes], and sure enough there is," says bioinformatician Ewan Birney of the European Bioinformatics Institute near Cambridge, England, a member of the project's computer analysis team.
Although researchers do not yet know the biological significance of these discoveries, they say that fully cataloguing the genome may help them understand how genetic variations affect the risk of contracting diseases such as cancer as well as how humans grow from a single-celled embryo into an adult. The next phase of the project, set to begin later this year, will attempt to inventory the full genome.
A genome consists of only four different nucleotide bases, or DNA subunits, arranged in a particular sequence. The publication of the human genome in 2001 revealed its sequence—the significance of which remains a mystery. In particular, genes account for only 1.2 percent of the genome's three billion bases. Once dismissed as "junk DNA," researchers have found that some of these so-called noncoding regions are shared among mammals, suggesting they play an important function.
To help uncover those functions and identify other important sequences, 35 research groups joined forces in 2003 to create the encyclopedia of DNA elements (ENCODE) project. This consortium selected 44 separate sections of the genome that included regions of high to low gene density and high to low similarity between mouse and human.
Like treasure hunters combing a vast beach with metal detectors, ENCODE researchers sifted through their patch of the genome in multiple ways that are described, along with the results, in a Nature paper published online today and in a special issue of Genome Research.
A major part of the project was identifying sequences that cells copy, or transcribe, into RNA molecules. Cells make proteins from RNA they copy from genes, but some RNAs play roles by themselves. In addition, some studies have found evidence that species from flies and worms to humans copy large amounts of RNA from noncoding DNA, with no apparent purpose. Nevertheless, "before ENCODE, I think a lot of people were skeptical of how real intergenic activity was," says bioinformatician and consortium member Mark Gerstein of Yale University.
Although genes make up only 3 percent of the ENCODE sequence, the consortium found that 93 percent of the sequence is transcribed. Some of the transcripts hail from noncoding DNA, the researchers report, but those that do match up with the 399 ENCODE genes overlap with each other extensively.
Transcripts from 65 percent of the genes incorporate pieces of DNA from relatively far outside of the genes or even from one or two other genes, says molecular biologist and consortium member Tom Gingeras of Affymetrix, a genome technology company in Santa Clara, Calif. Researchers know that cells chop single genes into shorter pieces called exons, which they mix and match into one transcript for creating a protein. Gingeras says the ENCODE findings confirm recent reports that humans and flies sometimes combine exons from two different genes.
Based on the transcript sequences, the researchers identified 1,437 new promoters—short DNA sequences where transcription begins—in or between genes, on top of the 1,730 promoters they knew of. That is nearly ten promoters per gene, Birney says. He adds that the abundance of transcripts that overlap each gene suggests that the very term "gene" should mean something different inside the cell nucleus, where transcription takes place, than outside of it, where finished proteins go.
Project members also catalogued sequences that mark areas where DNA unwinds from the round histone proteins that maintain the shape of chromosomes, allowing the cell's transcription machinery to activate genes in those areas. They discovered some potentially unwound areas that are far from promoters and may therefore play some other role, Birney says.
The consortium found that 5 percent of the studied sequence has been conserved among 23 mammals, suggesting that it plays an important enough role for evolution to preserve while species have evolved. But of all the new ENCODE sequences identified as potentially important, only half fall into the conserved group.
These unconserved sequences may be "bystanders," Birney says—consequences of the genome's other functions—that neither help nor hurt cells and may have provided fodder for past evolution.
They could also simply maintain a useful DNA structure or spacing between pieces of DNA regardless of their particular sequence, says genomics researcher T. Ryan Gregory of the University of Guelph in Ontario, who was not part of the consortium.
"The biological insights are mainly incremental at this point," says genome biologist George Weinstock of the Baylor College of Medicine in Houston, which he says is to be expected of such a pilot study. "This is a 'community resource' project, like a genome project, that makes lots of new data available to the community, who then dig into it and mine it for discoveries."
Gregory says the results, although still cryptic, do hint at new functions and a more complicated genome. "This study shows us how far we are from a comprehensive understanding of the human genome."