The research team, led by Nils Homer, then a graduate student at the University of California at Los Angeles, showed that in many cases, if you know a person’s genome, you can figure out beyond a reasonable doubt whether that person has participated in a particular genome-wide test group. After Homer’s paper appeared, the National Institutes of Health reversed a policy, instituted earlier that year, that had required aggregate data from all NIH-funded genome-wide association studies to be posted publicly.
Perhaps even more surprisingly, researchers showed in 2011 that it is possible to glean personal information about purchases from Amazon.com’s product recommendation system, which makes aggregate-level statements of the form, “Customers who bought this item also bought A, B and C.” By observing how the recommendations changed over time and cross-referencing them with customers’ public reviews of purchased items, the researchers were able in several cases to infer that a particular customer had bought a particular item on a particular day — even before the customer had posted a review of the item.
In all these cases, the privacy measures that had been taken seemed adequate, until they were breached. But even as the list of privacy failures ballooned, a different approach to data release was in the making, one that came with an a priori privacy guarantee. To achieve this goal, researchers had gone back to basics: Just what does it mean, they asked, to protect privacy?
If researchers study a health database and discover a link between smoking and some form of cancer, differential privacy will not protect a public smoker from being labeled with elevated cancer risk. But if a person’s smoking is a secret hidden in the database, differential privacy will protect that secret.
“’Differential’ refers to the difference between two worlds — one in which you allow your sensitive data to be included in the database and one in which you don’t,” McSherry said. The two worlds cannot be made to work out exactly the same, but they can be made close enough that they are effectively indistinguishable. That, he said, is the goal of differential privacy.
Differential privacy focuses on information-releasing algorithms, which take in questions about a database and spit out answers — not exact answers, but answers that have been randomly altered in a prescribed way. When the same question is asked of a pair of databases (A and B) that differ only with regard to a single individual (Person X), the algorithm should spit out essentially the same answers.
More precisely, given any answer that the algorithm could conceivably spit out, the probability of getting that answer should be almost exactly the same for both databases; that is, the ratio of these two probabilities should be bounded by some number R close to 1. The closer R is to 1, the more difficult it will be for an attacker to figure out whether he is getting information about database A or database B and the better protected Person X will be. After all, if the attacker can’t even figure out whether the information he is getting includes Person X’s data, he certainly can’t figure out what Person X’s data is.
(Differential privacy researchers usually prefer to speak in terms of the logarithm of R, which they denote Ɛ. This parameter puts a number on how much privacy leaks out when the algorithm is carried out: The closer Ɛ is to 0, the better the algorithm is at protecting privacy.)
To get a sense of how differentially private algorithms can be constructed, let’s look at one of the simplest such algorithms. It focuses on a scenario in which a questioner is limited to “counting queries”; for example: “How many people in the database have property P?”