Genome Databases Suffer from the Human Touch

Contamination of non-primate DNA archives with human sequences indicates that better screening is needed.


Nature













Share on Tumblr

By Melissa Lee Phillips

Around a fifth of non-primate genome databases seem to be contaminated with human DNA sequences, according to a study.

Mark Longo, a geneticist at the University of Connecticut in Storrs, and his colleagues found that 18% of public databases of bacterial, plant and animal genome sequences contain stretches of human DNA, possibly as a result of researchers handling samples during sequencing. Their findings are published today in PLoS ONE1.

David Haussler, a biomolecular engineer at the University of California, Santa Cruz, says that many genomics researchers are already aware of the presence of human DNA artefacts in genome assemblies, but this contamination has never been quantified.

Forensic scientists and researchers working with ancient DNA take extreme measures to avoid contamination, but most sequencing projects aren't so stringent. "It would be hugely expensive," says Haussler.

Filter failure

Longo and his colleagues decided to investigate genome-database contamination after discovering human sequences in a project on the genome of the zebrafish (Danio rerio). They scanned non-primate genome databases for genetic 'Alu' elements -- short stretches of DNA characterized by the action of a particular enzyme -- that are abundant in the human genome and are known to be specific to primates.

The researchers found human DNA sequences in 492 out of 2,749 sequencing archives that they checked. Contamination showed up in raw sequencing data and in final assemblies of data that had been pieced together by computers to compile a complete genome sequence.

Most contamination in assembled sequences consisted of only a few hundred DNA bases at a time, although in a minority of cases stretches of more than a thousand human bases were seen in assembled non-primate sequences.

It is not surprising to find contamination in raw sequencing reads, says Robert Waterston, a genomicist at the University of Washington in Seattle. These are "unvarnished raw data", he says, and "some contamination seems inevitable". However, "the presence of human sequences in assemblies is another matter", he adds.

The computer algorithms that assemble sequences ought to spot contamination and remove artefacts, says Haussler. The latest findings "represent a failure of that filter".

Longo and his team speculate that contaminating sequences could come either from skin and hair cells from the people who handle samples, or from other DNA libraries that are kept at the same facilities. The team also recorded contamination from species other than humans, which lends credence to the latter possibility, says Rachel O'Neill, a cell biologist at the University of Connecticut and the study's lead author. The researchers found, for example, evidence that databases of platypus (Ornithorhynchus anatinus) DNA contain some sequences that probably originate from tammar wallabies (Macropus eugenii).

Spring cleaning

"It would be great if we could clean up" these artefacts, says Haussler. Using forensics-level precautions would be prohibitively expensive for most projects, but bioinformatics specialists need to improve their contamination filters before final assemblies are released, he says. Most sequences are routinely updated, so "I would hope that the next versions will have this contamination eliminated", he adds.

O'Neill says that the major concern surrounding contamination lies not so much in the errors that have been introduced so far, but, "looking forward, in translating the next wave of genomics into clinical practice".

It is often straightforward to spot human contamination in non-human genomes, but it will not be so easy to identify contamination of one human genome with another. As more labs and companies begin to sequence whole genomes of individual people for personalized medicine or to study how genetic differences affect disease, potential contamination "is going to be very difficult to track", says O'Neill.


Nature

3 Comments

Add Comment
View
  1. 1. jtdwyer 10:48 PM 2/16/11

    Using stringent measures to ensure that samples are not contaminated may be expensive, but what value is there in corrupted data? Even filtering out contaminated sequences would likely eliminate valid sequences and would not likely reinsert omitted sequences.

    Corrupted or expunged data has seriously limited value since any conclusions derived from it must be suspect. They're wasting expensive resources on obtaining misleading data: if they can't pay for correct data then just save all this wasted and potentially damaging effort!

    Reply | Report Abuse | Link to this
  2. 2. gmacdonnell 11:16 AM 2/17/11

    We have suggested for a while that a "confidence" field be added to the gene bank and others. This field would match up to the protocol used as well as the skill level of the people or person submitting the sequence.

    Reply | Report Abuse | Link to this
  3. 3. whatsup 08:58 PM 2/17/11

    Wonder if human DNA got mixed in with Neanderthals via "sleeping around" or sloppy lab work?

    whatsup

    Reply | Report Abuse | Link to this
Leave this field empty

Add a Comment

You must sign in or register as a ScientificAmerican.com member to submit a comment.
Click one of the buttons below to register using an existing Social Account.

More from Scientific American

See what we're tweeting about

Scientific American Editors

More »

Free Newsletters


Get the best from Scientific American in your inbox

Solve Innovation Challenges

Powered By: Innocentive

  SA Digital
  SA Digital

Email this Article

Genome Databases Suffer from the Human Touch

X
Scientific American Magazine

Subscribe Today

Save 66% off the cover price and get a free gift!

Learn More >>

X

Please Log In

Forgot: Password

X

Account Linking

Welcome, . Do you have an existing ScientificAmerican.com account?

Yes, please link my existing account with for quick, secure access.



Forgot Password?

No, I would like to create a new account with my profile information.

Create Account
X

Report Abuse

Are you sure?

X

Institutional Access

It has been identified that the institution you are trying to access this article from has institutional site license access to Scientific American on nature.com. To access this article in its entirety through site license access, click below.

Site license access
X

Error

X

Share this Article

X