Genome Databases Suffer from the Human Touch

Contamination of non-primate DNA archives with human sequences indicates that better screening is needed.

Join Our Community of Science Lovers!

By Melissa Lee Phillips

Around a fifth of non-primate genome databases seem to be contaminated with human DNA sequences, according to a study.

Mark Longo, a geneticist at the University of Connecticut in Storrs, and his colleagues found that 18% of public databases of bacterial, plant and animal genome sequences contain stretches of human DNA, possibly as a result of researchers handling samples during sequencing. Their findings are published today in PLoS ONE1.

On supporting science journalism

If you're enjoying this article, consider supporting our award-winning journalism by subscribing. By purchasing a subscription you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today.

David Haussler, a biomolecular engineer at the University of California, Santa Cruz, says that many genomics researchers are already aware of the presence of human DNA artefacts in genome assemblies, but this contamination has never been quantified.

Forensic scientists and researchers working with ancient DNA take extreme measures to avoid contamination, but most sequencing projects aren't so stringent. "It would be hugely expensive," says Haussler.

Filter failure

Longo and his colleagues decided to investigate genome-database contamination after discovering human sequences in a project on the genome of the zebrafish (Danio rerio). They scanned non-primate genome databases for genetic 'Alu' elements -- short stretches of DNA characterized by the action of a particular enzyme -- that are abundant in the human genome and are known to be specific to primates.

The researchers found human DNA sequences in 492 out of 2,749 sequencing archives that they checked. Contamination showed up in raw sequencing data and in final assemblies of data that had been pieced together by computers to compile a complete genome sequence.

Most contamination in assembled sequences consisted of only a few hundred DNA bases at a time, although in a minority of cases stretches of more than a thousand human bases were seen in assembled non-primate sequences.

It is not surprising to find contamination in raw sequencing reads, says Robert Waterston, a genomicist at the University of Washington in Seattle. These are "unvarnished raw data", he says, and "some contamination seems inevitable". However, "the presence of human sequences in assemblies is another matter", he adds.

The computer algorithms that assemble sequences ought to spot contamination and remove artefacts, says Haussler. The latest findings "represent a failure of that filter".

Longo and his team speculate that contaminating sequences could come either from skin and hair cells from the people who handle samples, or from other DNA libraries that are kept at the same facilities. The team also recorded contamination from species other than humans, which lends credence to the latter possibility, says Rachel O'Neill, a cell biologist at the University of Connecticut and the study's lead author. The researchers found, for example, evidence that databases of platypus (Ornithorhynchus anatinus) DNA contain some sequences that probably originate from tammar wallabies (Macropus eugenii).

Spring cleaning

"It would be great if we could clean up" these artefacts, says Haussler. Using forensics-level precautions would be prohibitively expensive for most projects, but bioinformatics specialists need to improve their contamination filters before final assemblies are released, he says. Most sequences are routinely updated, so "I would hope that the next versions will have this contamination eliminated", he adds.

O'Neill says that the major concern surrounding contamination lies not so much in the errors that have been introduced so far, but, "looking forward, in translating the next wave of genomics into clinical practice".

It is often straightforward to spot human contamination in non-human genomes, but it will not be so easy to identify contamination of one human genome with another. As more labs and companies begin to sequence whole genomes of individual people for personalized medicine or to study how genetic differences affect disease, potential contamination "is going to be very difficult to track", says O'Neill.

It’s Time to Stand Up for Science

If you enjoyed this article, I’d like to ask for your support. Scientific American has served as an advocate for science and industry for 180 years, and right now may be the most critical moment in that two-century history.

I’ve been a Scientific American subscriber since I was 12 years old, and it helped shape the way I look at the world. SciAm always educates and delights me, and inspires a sense of awe for our vast, beautiful universe. I hope it does that for you, too.

If you subscribe to Scientific American, you help ensure that our coverage is centered on meaningful research and discovery; that we have the resources to report on the decisions that threaten labs across the U.S.; and that we support both budding and working scientists at a time when the value of science itself too often goes unrecognized.

In return, you get essential news, captivating podcasts, brilliant infographics, can't-miss newsletters, must-watch videos, challenging games, and the science world's best writing and reporting. You can even gift someone a subscription.

There has never been a more important time for us to stand up and show why science matters. I hope you’ll support us in that mission.

Thank you,

David M. Ewalt, Editor in Chief, Scientific American