Reading the Book of Life

We have only about twice as many genes as a worm or fly¿far fewer than anyone guessed. So now what?

Image: DOE HUMAN GENOME PROGRAM

GENES are encoded in DNA by four bases, the letters of the genetic alphabet (A,G,T,C), and can be very difficult to identify. Chromosomes, located in the cell's nucleus, contain the DNA.

Last summer the world celebrated when scientists from the Human Genome Project, an international consortium of academic research centers, and Celera Genomics, a private U.S. company, both announced that they had finished working drafts of the human genome. It was an important first step toward deciphering the entire genome, one of the greatest scientific undertakings of all time. But these drafts revealed only the beginning of the story¿the scrolls containing the instructions for life. Now both teams have started reading¿gene after gene¿the actual scriptures within the scrolls. Today they will announce the results of their analyses, which will appear in separate papers in this week¿s Nature and Science.

Among other surprises, both papers agree that humans have a mere 26,000 to 40,000 genes¿which is far fewer than many people predicted. For perspective, consider that the simple roundworm Caenorhabditis elegans has 18,000 genes; the fruit fly Drosophila melanogaster, 13,000. As of last summer, some estimated the human genome might include as many as 140,000 genes. It will be several more years before scientists agree on an absolute total, but most are confident that the final number won¿t fall out of the range reported today. "I wouldn¿t be shocked if it was 29,000 or 36,000," says Francis Collins, director of the National Human Research Institute at the NIH. "But I would be shocked if it was 50,000 or 20,000."

On supporting science journalism

If you're enjoying this article, consider supporting our award-winning journalism by subscribing. By purchasing a subscription you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today.

An error margin of some 10,000 genes may not seem impressive after so many years of work, but genes¿the actual units of DNA that encode RNA and proteins¿are very difficult to count. For one thing, they are scattered throughout the genome like proverbial needles in a haystack: their coding parts constitute only about 1 to 1.5 percent of the roughly three billion base pairs in the human genome. The coding region of a gene is fragmented into little pieces, called exons, linked by long stretches of noncoding DNA, or introns. Only when messenger RNA is made during a process called transcription are the exons spliced together.

Image: DOE HUMAN GENOME PROGRAM

CLUES BY COMPARISON. The mouse genome can help scientists identify human genes because most mouse and human genes are very similar; their sequences are conserved in both genomes.

To identify functional genes, Collins explains, the scientists had to "depend upon a variety of bits of clues." Some clues come from comparisons with databases of complementary DNAs (cDNAs), which are exact copies of messenger RNAs. So, too, comparisons with the mouse genome help because most mouse and human genes are very similar; their sequences are conserved in both genomes, whereas a lot of the surrounding DNA is not. And when such clues aren¿t available, scientists rely exclusively on gene-predicting computer algorithms.

Because these algorithms are not totally reliable¿sometimes they see a gene where there is none or miss one altogether¿a few scientists doubt the new human gene count. For instance, William Haseltine of Human Genome Sciences¿a company that specializes in finding protein-encoding genes only on the basis of cDNA¿thinks that "the methods that have been used are very crude and inexact." He believes that there are more than twice as many genes as reported thus far by the two groups.

But many others do accept the current estimates and are asking what it means that humans should have so few genes. According to Craig Venter, president of Celera Genomics, "the small number of genes, means that there is not a gene for each human trait, that these come at the protein level and at the complex cellular level." As it turns out, at least every third human gene makes several different proteins through "alternative splicing" of its pre-messenger-RNA. Also human proteins have a more complicated architecture than their worm and fly counterparts, adding another level of complexity. And compared with simpler organisms, humans possess extra proteins having functions, for example, in the immune system and the nervous system, and for blood clotting, cell signaling and development.

Scientists are also puzzling over the significance of the discovery that more than 200 genes from bacteria apparently invaded the human genome millions of years ago, becoming permanent additions. Today, the new work shows, some of these bacterial genes have taken over important human functions, such as regulating responses to stress. "This is kind of a shocker and will no doubt inspire some further study," Collins says. Indeed, scientists previously thought that this kind of horizontal gene transfer was not possible in vertebrates.

Another curious feature of the human genome is its overall landscape, in which gene-dense and gene-poor regions alternate. "There are these areas that look like urban areas with skyscrapers of gene sequences packed on top of each other," Collins explains, "and then there are these big deserts where there doesn¿t seem to be anything going on for millions of base pairs." Moreover, such differences are apparent not only within but also between chromosomes. Chromosome 19, for example, is about four times richer in genes than the Y chromosome.

So what¿s going on in gene deserts? More than half the human genome consists of repeat sequences, also known as "junk DNA" because they have no known function. Vertebrates can live well without them: the puffer fish, for example, has a genome with very few of these repeats. In humans, most of them derive from transposable elements, parasitic stretches of DNA that replicate and insert a copy of themselves at another site. But now almost all the different families of transposons seem to have stopped roaming the genome, and only their "fossils" remain. Still, nearly 50 genes appear to originate from transposons, suggesting they played some useful role during the genome¿s evolution.

Image: DOE HUMAN GENOME PROGRAM

HARDLY DONE. Only one billion base pairs (yellow, orange and blues, above), or a third of the total, in the public database are in a "finished" form.

One type of transposon, the so-called Alu element, is found especially often in regions rich in G and C bases. These areas also harbor many genes, and so Alu¿s might somehow be beneficial around them. Overall, the human genome once seemed to be "a complex ecosystem, with all these different elements trying to proliferate," says Robert Waterston, director of the Genome Sequencing Center at the University of Washington, a member of the public consortium. Today the mutations they have accumulated provide an excellent molecular fossil record of the evolutionary history of humankind.

In addition to repeat sequences caused by transposons, large segments of the genome seem to have duplicated over time, both within and between chromosomes. This duplication, researchers say, allowed evolution to play with different genes without destroying their original function and probably led to the expansion of many gene families in humans.

Apart from the genome sequence, both the Human Genome Project and Celera have identified a multitude of base positions in the DNA that differ between individuals and are called single polynucleotide polymorphisms, or SNPs (pronounced "snips"). The public consortium discovered 1.4 million SNPs, and Celera announced it had found 2.1 million of them. Scientists are hoping to learn from them how genes make people different and, in particular, why some are more susceptible to certain diseases than others. "It will certainly take us a long time to figure out what they all mean, if they all mean anything, but I think the process is already beginning," Waterston notes.

To be sure, much work remains. Only one billion base pairs, a third of the total, in the public database are in a "finished" form, meaning they are highly accurate and without gaps. Both the Celera and the public data contain numerous gaps at the moment. In addition, large parts of the heterochromatin¿a gene-poor, repeat-rich part of the DNA that accounts for about 10 percent of the genome¿has yet to be cloned and sequenced. By the spring of 2003, the public project is hoping to finish that task, except for sequences that turn out to be impossible to obtain using current methods.

The next big challenge will be to find out how the genes interact in a cell. According to Collins, researchers will "begin to look at biology in a whole-genome way," studying, for example, the expression of all genes in a cell at a given time. Proteins, the products of the genes, will also be studied "not just one at a time, but tens of thousands at a time," Collins says, speaking of a fast-growing research field that goes by the name of proteomics. In the end, however, genes may provide only so many answers. "The basic message," Venter concludes, "is that humans are not hardwired. People who were looking for deterministic explanations for everything in their lives will be very disappointed, and people who are looking for the genome to absolve them of personal responsibility will be even more disappointed.

It’s Time to Stand Up for Science

If you enjoyed this article, I’d like to ask for your support. Scientific American has served as an advocate for science and industry for 180 years, and right now may be the most critical moment in that two-century history.

I’ve been a Scientific American subscriber since I was 12 years old, and it helped shape the way I look at the world. SciAm always educates and delights me, and inspires a sense of awe for our vast, beautiful universe. I hope it does that for you, too.

If you subscribe to Scientific American, you help ensure that our coverage is centered on meaningful research and discovery; that we have the resources to report on the decisions that threaten labs across the U.S.; and that we support both budding and working scientists at a time when the value of science itself too often goes unrecognized.

In return, you get essential news, captivating podcasts, brilliant infographics, can't-miss newsletters, must-watch videos, challenging games, and the science world's best writing and reporting. You can even gift someone a subscription.

There has never been a more important time for us to stand up and show why science matters. I hope you’ll support us in that mission.

Thank you,

David M. Ewalt, Editor in Chief, Scientific American