Linking a human genome in an anonymous sequencing database to its real-world counterpart wasn’t supposed to be possible.
Yaniv Erlich, a geneticist at the Massachusetts Institute of Technology’s Whitehead Institute for Biomedical Research, apparently never got the memo. In the end all it took him and M.I.T. undergraduate student Melissa Gymrek to decipher the identity of 50 individuals whose DNA is available online in free-access databases was a computer and an Internet connection.
Erlich and Gymrek selected 32 male genomes from the 1000 Genomes Project, which has a publicly accessible database designed to help researchers find genes associated with different human diseases. Next, Erlich and Gymrek used an algorithm to extract genetic markers from the DNA sequences. The algorithm is specially designed to hone in on short tandem repeats on a man’s Y chromosome. Y-STRs are passed patrilineally with little to no change from one generation to the next. They provide a way to link an anonymous genome to a particular family surname.
Using meta-data about the anonymous genomes included in the database, the researchers narrowed the field of possible DNA matches down to 10,000 men of a particular age who resided in Utah when they donated their DNA. Erlich and Gymrek then plugged the genomes into two of the Web’s most popular genealogy sites, Ysearch and SMGF. These recreational sites provide free access to databases that connect Y-STR markers to surnames. The researchers found that eight of their samples strongly matched the surnames of Mormon families in Utah. Erlich and Gymrek’s findings were published in the January 17 Science.
The results show that a curious party equipped with open-access information can not only tie a three-billion-digit-long genome directly to an individual, but also can use bits and pieces of that same DNA to identify distant relatives, male or female, of the original genetic donor. “If your fourth cousin participated in this database, we could use it to find out about your ancestry,” Erlich says.
Whereas privacy concerns about publicly accessible genome data have cropped up in the past with genealogy databases, this is the first time that anyone has connected an anonymous DNA sequence to its donor without donor DNA as a reference.
Genome mining could have serious consequences for DNA donors. Under federal law health insurance companies cannot use genetic data, but there is currently nothing barring companies from using a person’s genome to define life insurance policies or determine long-term disability care. The new research prompted the National Institutes of Health (NIH) to hide people’s ages from federally funded genetic databases such as the 1000 Genomes Project that allow open access to scientists.
Yet the NIH’s strategy may be missing the point, says Lawrence Gostin, a professor of medicine at Georgetown University and director of the World Health Organization’s Collaborating Center on Public Health Law and Human Rights. “This is not a long-term solution to the problem because in reality there is nothing more personally identifiable than your genome,” he says.
Although only talented geneticists would be able to hack a genome like Erlich did, as computing gets more sophisticated and more data becomes available, the prospect becomes more likely.
Open-access genetic databases make big contributions to medical research, Erlich says. Only through studying diverse groups of individuals can scientists detect DNA variants that affect a person’s susceptibility to medical conditions like heart disease and diabetes. Erlich says identifying characteristics such as hair and eye color, facial features and age greatly contribute to how useful this data is. He says ensuring complete privacy means limiting the use of information that might be used to identify a subject.