By Melissa Lee Phillips
Around a fifth of non-primate genome databases seem to be contaminated with human DNA sequences, according to a study.
Mark Longo, a geneticist at the University of Connecticut in Storrs, and his colleagues found that 18% of public databases of bacterial, plant and animal genome sequences contain stretches of human DNA, possibly as a result of researchers handling samples during sequencing. Their findings are published today in PLoS ONE1.
David Haussler, a biomolecular engineer at the University of California, Santa Cruz, says that many genomics researchers are already aware of the presence of human DNA artefacts in genome assemblies, but this contamination has never been quantified.
Forensic scientists and researchers working with ancient DNA take extreme measures to avoid contamination, but most sequencing projects aren't so stringent. "It would be hugely expensive," says Haussler.
Longo and his colleagues decided to investigate genome-database contamination after discovering human sequences in a project on the genome of the zebrafish (Danio rerio). They scanned non-primate genome databases for genetic 'Alu' elements -- short stretches of DNA characterized by the action of a particular enzyme -- that are abundant in the human genome and are known to be specific to primates.
The researchers found human DNA sequences in 492 out of 2,749 sequencing archives that they checked. Contamination showed up in raw sequencing data and in final assemblies of data that had been pieced together by computers to compile a complete genome sequence.
Most contamination in assembled sequences consisted of only a few hundred DNA bases at a time, although in a minority of cases stretches of more than a thousand human bases were seen in assembled non-primate sequences.
It is not surprising to find contamination in raw sequencing reads, says Robert Waterston, a genomicist at the University of Washington in Seattle. These are "unvarnished raw data", he says, and "some contamination seems inevitable". However, "the presence of human sequences in assemblies is another matter", he adds.
The computer algorithms that assemble sequences ought to spot contamination and remove artefacts, says Haussler. The latest findings "represent a failure of that filter".
Longo and his team speculate that contaminating sequences could come either from skin and hair cells from the people who handle samples, or from other DNA libraries that are kept at the same facilities. The team also recorded contamination from species other than humans, which lends credence to the latter possibility, says Rachel O'Neill, a cell biologist at the University of Connecticut and the study's lead author. The researchers found, for example, evidence that databases of platypus (Ornithorhynchus anatinus) DNA contain some sequences that probably originate from tammar wallabies (Macropus eugenii).
"It would be great if we could clean up" these artefacts, says Haussler. Using forensics-level precautions would be prohibitively expensive for most projects, but bioinformatics specialists need to improve their contamination filters before final assemblies are released, he says. Most sequences are routinely updated, so "I would hope that the next versions will have this contamination eliminated", he adds.
O'Neill says that the major concern surrounding contamination lies not so much in the errors that have been introduced so far, but, "looking forward, in translating the next wave of genomics into clinical practice".
It is often straightforward to spot human contamination in non-human genomes, but it will not be so easy to identify contamination of one human genome with another. As more labs and companies begin to sequence whole genomes of individual people for personalized medicine or to study how genetic differences affect disease, potential contamination "is going to be very difficult to track", says O'Neill.