In other words, the naive approach of adding Laplace noise to each question independently is limited in terms of the number of questions to which it can provide useful answers. To deal with this, computer scientists have developed an arsenal of more powerful primitives — algorithmic building blocks which, by taking into account the particular structure of a database and problem type, can answer more questions with more accuracy than the naive approach can.
For example, In 2005, Smith noticed that the baby names problem has a special structure: removing one person’s personal information from the database changes the answer for only one of the 10,000 names in the database. Because of this attribute, we can get away with adding only 1/Ɛ in Laplace noise to each name answer, instead of 10,000/Ɛ, and the outcome will stay within our Ɛ privacy budget. This algorithm is a primitive that can be applied to any “histogram” query — that is, one asking how many people fall into each of several mutually exclusive categories, such as first names.
When Smith told Dwork about this insight in the early days of differential privacy research, “something inside me went, ‘Wow!’” Dwork said. “I realized that we could exploit the structure of a query or computation to get much greater accuracy than I had realized.”
Since that time, computer scientists have developed a large library of such primitives. And because the additive rule explains what happens to the privacy parameter when algorithms are combined, computer scientists can assemble these building blocks into complex structures while keeping tabs on just how much privacy the resulting algorithms use up.
“One of the achievements in this area has been to come up with algorithms that can handle a very large number of queries with a relatively small amount of noise,” said Moritz Hardt of IBM Research Almadenin San Jose, Calif.
To make differential privacy more accessible to nonexperts, several groups are working to create a differential privacy programming language that would abstract away all the underlying mathematics of the algorithmic primitives to a layer that the user doesn’t have to think about.
“If you’re the curator of a dataset, you don’t have to worry about what people are doing with your dataset as long as they are running queries written in this language,” said McSherry, who has created one preliminary such language, calledPINQ. “The program serves as a proof that the query is OK.”
A Nonrenewable Resource
Because the simple additive Ɛ rule gives a precise upper limit on how much total privacy you lose when the various databases you belong to release information in a differentially private way, the additive rule turns privacy into a “fungible currency,” McSherry said.
For example, if you were to decide how much total lifetime privacy loss would be acceptable to you, you could then decide how you want to “spend” it — whether in exchange for money, perhaps, or to support a research project you admire. Each time you allowed your data to be used in a differentially private data release, you would know exactly how much of your privacy budget remained.
Likewise, the curator of a dataset of sensitive information could decide how to spend whatever amount of privacy she had decided to release — perhaps by inviting proposals for research projects that would describe not only what questions the researchers wanted to ask and why, but also how much privacy the project would use up. The curator could then decide which projects would make the most worthwhile use of the dataset’s predetermined privacy budget. Once this budget had been used up, the dataset could be closed to further study.