This privacy standard may seem so high as to be unattainable — and indeed, there is no useful differentially private algorithm that gives out exactly the same information regardless of whether it includes the real you or the pretend you. But if we allow algorithms that give out almost exactly the same information in the two cases, then useful and efficient algorithms do exist. This “almost” is a precisely calibrated parameter, a measurable quantification of privacy. Individuals or social institutions could decide what value of this parameter represents an acceptable loss of privacy, and then differentially private algorithms could be chosen that guarantee that the privacy loss is less than the selected parameter.
Privacy experts have developed a wide assortment of specialized differentially private algorithms to handle different kinds of data and questions about the data. Although much of this work is technical and difficult for nonexperts to penetrate, researchers are starting to build standardized computer languages that would allow nonexperts to release sensitive data in a differentially private way by writing a simple computer program.
One real-world application already uses differential privacy: a Census Bureau project calledOnTheMap, which gives researchers access to agency data. Also, differential privacy researchers have fielded preliminary inquiries from Facebook and the federally funded iDASH center at the University of California, San Diego, whose mandate in large part is to find ways for researchers to share biomedical data without compromising privacy.
“Differential privacy is a promising and exciting technology,” said Aaron Roth, a computer scientist at the University of Pennsylvania.
Needle in a Haystack
It might seem that a simpler solution to the privacy problem would be to release only “aggregate” information — statements about large groups of people. But even this approach is susceptible to breaches of privacy.
Suppose you wanted to ascertain whether this writer has diabetes and you knew I belonged to a health database. You could find this out simply by subtracting the answers to two aggregate-level questions: “How many people in the database have diabetes?” and “How many people in the database not named Erica Klarreich have diabetes?”
Clearly, these two questions, when combined, violate my privacy. But it’s not always easy to spot which combinations of questions would constitute privacy breaches. Spotting such combinations is, in its full generality, what computer scientists call an “NP-hard” problem, which means that there is probably no efficient computer algorithm that could catch all such attacks.
And when the attacker has access to outside information about individuals in the database, extracting private information from aggregate statistics becomes even easier.
In 2008, a research team demonstrated the dangers of releasing aggregate information from genome-wide association studies, one of the primary research vehicles for uncovering links between diseases and particular genes. These studies typically involve sequencing the genomes of a test group of 100 to 1,000 patients who have the same disease and then calculating the average frequency in the group of something on the order of 100,000 different mutations. If a mutation appears in the group far more frequently than in the general population, that mutation is flagged as a possible cause or contributor to the disease.