With a differentially private algorithm, there’s no need to analyze a question carefully to determine whether it seeks to invade an individual’s privacy; that protection is automatically built into the algorithm’s functioning. Because prying questions usually boil down to small numbers related to specific people and non-prying questions examine aggregate-level behavior of large groups, the same amount of added noise that renders answers about individuals meaningless will have only a minor effect on answers to many legitimate research questions.
With differential privacy, the kinds of issues that plagued other data releases — such as attackers cross-referencing data with outside information — disappear. The approach’s mathematical privacy guarantees do not depend on the attacker having limited outside information or resources.
“Differential privacy assumes that the adversary is all-powerful,” McSherry said. “Even if attackers were to come back 100 years later, with 100 years’ worth of thought and information and computer technology, they still would not be able to figure out whether you are in the database. Differential privacy is future-proofed.”
A Fundamental Primitive
So far, we have focused on a situation in which someone asks a single counting query about a single database. But the real world is considerably more complex.
Researchers typically want to ask many questions about a database. And over your lifetime, snippets of your personal information will probably find their way into many different databases, each of which may be releasing data without consulting the others.
Differential privacy provides a precise and simple way to quantify the cumulative privacy hit you sustain if researchers ask multiple questions about the databases to which you belong. If you have sensitive data in two datasets, for example, and the curators of the two datasets release those data using algorithms whose privacy parameters are Ɛ1 and Ɛ2, respectively, then the total amount of your privacy that has leaked out is at most Ɛ1+ Ɛ2. The same additive relationship holds if a curator allows multiple questions about a single database. If researchers ask m questions about a database and each question gets answered with privacy parameter Ɛ, the total amount of privacy lost is at most mƐ.
So, in theory, the curator of a dataset could allow researchers to ask as many counting queries as he wishes, as long as he adds enough Laplace noise to each answer to ensure that the total amount of privacy that leaks out is less than his preselected privacy “budget.”
And although we have limited our attention to counting queries, it turns out that this restriction is not very significant. Many of the other question types that researchers like to ask can be recast in terms ofcounting queries. If you wanted to generate a list of the top 100 baby names for 2012, for example, you could ask a series of questions of the form, “How many babies were given names that start with A?” (or Aa, Ab or Ac), and work your way through the possibilities.
“One of the early results in machine learning is that almost everything that is possible in principle to learn can be learned through counting queries,” Roth said. “Counting queries are not isolated toy problems, but a fundamental primitive” — that is, a building block from which many more complex algorithms can be built.
But there’s a catch. The more questions we want to allow, the less privacy each question is allowed to use up from the privacy budget and the more noise has to be added to each answer. Consider the baby names question. If we decide on a total privacy budget Ɛ of 0.01 and there are 10,000 names to ask about, each question’s individual privacy budget is only Ɛ/10,000, or 0.000001. The expected amount of noise added to each answer will be 10,000/Ɛ, or 1,000,000 — an amount that will swamp the true answers.