With a differentially private algorithm, there’s no need to analyze a question carefully to determine whether it seeks to invade an individual’s privacy; that protection is automatically built into the algorithm’s functioning. Because prying questions usually boil down to small numbers related to specific people and non-prying questions examine aggregate-level behavior of large groups, the same amount of added noise that renders answers about individuals meaningless will have only a minor effect on answers to many legitimate research questions.
With differential privacy, the kinds of issues that plagued other data releases — such as attackers cross-referencing data with outside information — disappear. The approach’s mathematical privacy guarantees do not depend on the attacker having limited outside information or resources.
“Differential privacy assumes that the adversary is all-powerful,” McSherry said. “Even if attackers were to come back 100 years later, with 100 years’ worth of thought and information and computer technology, they still would not be able to figure out whether you are in the database. Differential privacy is future-proofed.”
A Fundamental Primitive
So far, we have focused on a situation in which someone asks a single counting query about a single database. But the real world is considerably more complex.
Researchers typically want to ask many questions about a database. And over your lifetime, snippets of your personal information will probably find their way into many different databases, each of which may be releasing data without consulting the others.
Differential privacy provides a precise and simple way to quantify the cumulative privacy hit you sustain if researchers ask multiple questions about the databases to which you belong. If you have sensitive data in two datasets, for example, and the curators of the two datasets release those data using algorithms whose privacy parameters are Ɛ1 and Ɛ2, respectively, then the total amount of your privacy that has leaked out is at most Ɛ1+ Ɛ2. The same additive relationship holds if a curator allows multiple questions about a single database. If researchers ask m questions about a database and each question gets answered with privacy parameter Ɛ, the total amount of privacy lost is at most mƐ.
So, in theory, the curator of a dataset could allow researchers to ask as many counting queries as he wishes, as long as he adds enough Laplace noise to each answer to ensure that the total amount of privacy that leaks out is less than his preselected privacy “budget.”
And although we have limited our attention to counting queries, it turns out that this restriction is not very significant. Many of the other question types that researchers like to ask can be recast in terms ofcounting queries. If you wanted to generate a list of the top 100 baby names for 2012, for example, you could ask a series of questions of the form, “How many babies were given names that start with A?” (or Aa, Ab or Ac), and work your way through the possibilities.
“One of the early results in machine learning is that almost everything that is possible in principle to learn can be learned through counting queries,” Roth said. “Counting queries are not isolated toy problems, but a fundamental primitive” — that is, a building block from which many more complex algorithms can be built.
But there’s a catch. The more questions we want to allow, the less privacy each question is allowed to use up from the privacy budget and the more noise has to be added to each answer. Consider the baby names question. If we decide on a total privacy budget Ɛ of 0.01 and there are 10,000 names to ask about, each question’s individual privacy budget is only Ɛ/10,000, or 0.000001. The expected amount of noise added to each answer will be 10,000/Ɛ, or 1,000,000 — an amount that will swamp the true answers.



See what we're tweeting about






16 Comments
Add CommentWe all have a great deal to gain by allowing our medical data to be shared - perhaps we should also put a price on openness.
Reply | Report Abuse | Link to thisA real-life example of adding random errors to data is from the early GPS network days. The satellites mis-reported their position slightly, so that users on the ground ended up with slightly incorrect locations. Military users, however, got a code to decrypt the position error sent by the satellites, and were able to determine their location more exactly. This allowed the general public to use GPS, but in wartime prevented the enemy from having as good an accuracy as your own military.
Reply | Report Abuse | Link to thisThe random errors were changed frequently to prevent opponents from decrypting the corrections in enough time to be of use.
This is all well and good. Most odd is the amount of attention paid to being private.
Reply | Report Abuse | Link to thisHas anyone asked just what is so sacred about medical
conditions?
It is as though some overwhelming fear of exposure has
taken over.
Privacy, overdone, can lead to suspicion, often a heavier burden than exposure.
Do any of us want to be exposed to vagaries of chance as a consequence of 'privacy'?
Do any recent events ring a bell?
Privacy must have limits with regard to how it affects the rest of us.
@kienhua68: I do agree with you. Apart from the fear of a higher insurance policy if propensity to disease is revealed (which anyway can be legislated out by requirements on the insurance companies) what exactly are people trying keep private - this is a genuine question.
Reply | Report Abuse | Link to thisThe whole concept of privacy will soon become obsolete. As more and more data sources come on-line, most of which with no regard for privacy concerns, the private will cease to be... People are already on the verge of live streaming their whole lives, and not only their lives but also the fractions of life which they share with every other person. With the current development of big data tools, you will be able to know everything about anybody, using only the indiscretions of everybody else around them. The problem with privacy is asymmetry, it's me knowing something about you, while you know nothing about me. When everybody knows everything about everybody else, privacy will not only have disappeared but will have become irrelevant.
Reply | Report Abuse | Link to thisI recently completed a Uni group project addressing these systems.
Reply | Report Abuse | Link to thisMedical records can include unflattering or worse images of yourself.
This data along with other data can be extremely revealing, potentially including biometrics.
I want the anonymous data to be fully accessible to doctors and researchers. I also see great value in other data relating to consumption (not the TB kind) being made available to all through independent government agencies, as businesses always run better with the lights on.
A trustable system is capable of bringing us not just "real democracy", but "real time democracy", impervious to undemocratic influences.
The more identifying information that can be kept from online networks, the better. Only now are businesses waking up to the lack of security within their online databases, as some small businesses have their data encrypted by hackers who then extort them for access. I have never trusted an online computer with my data. Keeping important stuff offline is a good insurance policy.
People have private phone numbers, emails, social networks and so on. These practices can save a lot of hassle. If some people want to let it all hang out, that's fine, but if you find you have an audience, I think you will appreciate some curtains.
If the existence of Stuxnet isn't enough to concern you regarding the security of online computers, check out the research going on which is opening up the possibility of spying on offline machines through detection of electric fields generated by keyboards, mice, displays and their cables. Of course, wireless is a pushover to those in the know.
Some see the problems we are trying to resolve as trivial. I see an unhackable identity system as being the most important piece of infrastructure in our future society. The other main problem we must confront is ensuring that forced access for law enforcement is always limited by an independent judiciary that adheres to an expanded charter of human rights (current one falls well short).
For those of you thinking that privacy is over blown or irrelevant, please post your name, address, date of birth and social security number. I need a new couch and see no reason not to stick you with the bill. I can also target scams to your particular health issues or risks.
Reply | Report Abuse | Link to this@Radguy - Nice to see that some people have the ability to think things through. Thanks for pointing out the obvious to the foolish. I doubt that it will do any good but at the very least you tried.
as soon as conditions are bad enough you will see those obamacare chips become manditory , then any scanner tuned to the right frequency can pick up any and all your info . enjoy your privacy till then . what man can invent , man can circumvent .
Reply | Report Abuse | Link to thisIt's funny when people predicting the end of privacy -- or actually calling for it -- post their comments under a pseudonym. See, everyone needs their privacy at some level, even if it's just closing the toilet door.
Reply | Report Abuse | Link to thisLess trivially, real discrimination exists and probably always will, around gender, sexual orientation, ethnicity and religion. Knowing this, many people have a particular interest in keeping some of their medical history confidential. Who really needs to know that you had an STD, an abortion, a misdemeanor, or a depressive episode in your teens or twenties?
Frequently, people who see no need for privacy have done well in picking their parents. White middle class middle aged men may not have had the same life experiences as other sections of the community and tend to have a different perspective on privacy.
OK, a sensible comment; but STDs, abortion, misdemeanors and depressive episodes do not inform on those areas you have quoted as involving discrimination. Sure nobody "needs" to know these things (which do also happen to white middle-class people), but nobody is suggesting making such data freely available - we're just questioning whether hindering and slowing down medical research on important medical issues is worth it to efficiently hide such personal things?
Reply | Report Abuse | Link to thisThanks Allan. Yes, there are interesting compromises inherent in ethical medical research.
Reply | Report Abuse | Link to thisIn my grab-bag of points (intended to collectively show how complex this all is), I was responding to a few peoples' remarks, such as kienhua68: "Has anyone asked just what is so sacred about medical conditions?" and allophor: "When everybody knows everything about everybody else, privacy will not only have disappeared but will have become irrelevant. "
That's frank nonsense, a naive fatalistic confusion of what is and what ought to be. International data protection (or "information privacy") laws anticipate the leaking and collecting of Personally Identifiable Information (PII) via many channels, and provide for sanctions against unauthorised use of PII *no matter where it comes from*. allophor says of Big Data "you will be able to know everything about anybody, using only the indiscretions of everybody else around them". What s/he may not appreciate is that under OECD privacy law in 80+ countries, PII that falls into your lap via "indiscretions" or indeed any third party channel is still protected. So it is just not the case that Big Data destroys privacy. Confidentiality and anonymity are not the same thing as privacy. Even if a second party manages to find out something about me, they are not free to do anything they like with that PII. See also http://lockstep.com.au/blog/2012/10/29/not-too-late-for-privacy.
Returning to medical research, I agree there's a balancing act. Human rights considerations are central to all ethical review processes.
People get emotional about privacy on both sides of the medical research debate, but here's a pragmatic consideration from the middle ground: patients are known to self censor their medical histories, especially when they don't trust the healthcare people and systems they're exposed to. Patients are naturally shy about their HIV status, their recreational drug use, their mental illness past or present. If we want reasonably complete data about people in medical research, then it is in everyone's interests to provide the very best privacy protection, so that participants have the requisite trust and confidence in what's going to happen to their data. In this light it is just abhorrent to hear people preach that privacy is over, and to cast dispersions on those who would choose to keep medical secrets.
Your last point surprises me - I wasn't aware that people have any ability to edit their medical history, certainly here in the UK. I agree incomplete data is little use to any researcher. Perhaps "reasonable" or "good enough" privacy protection could be achieved more quickly and at less cost than your "very best privacy protection".
Reply | Report Abuse | Link to thisSorry, when I said 'self censor their medical histories', I was referring in the main to people holding details back when a medico takes an oral history. Patients simply don't tell their doctors everything, but they reveal more to doctors they trust. Which is why many countries have women's health clinics, and in Australia we have dedicated Aboriginal and Torres Strait Islander health services.
Reply | Report Abuse | Link to thisActual editing of databases is not unknown. In Australia our proposed longitudinal EHR system now in development will include the ability for patients to elect which parts of their record are visible to certain medicos. This is a controversial feature to be sure. On balance the Dept of Health here was persuaded by patients rights groups over the protestations of doctors' groups.
[Sidebar: This reminds me of a special case which shows how difficult it is to make generalisations about health privacy. When a patient is admitted to a hospital where their ex-spouse works as a medico, there are usually special protocols (often informal) to keep the two parties apart. In small towns, this is a real and present problem. I have done privacy consulting in these sorts of places. It is very difficult to codify the rules, and therefore tricky to develop reliable EHR access control algorithms.
One case I saw first hand involved a female nurse who had previously had a clandestine affair with a male doctor in a country hospital. Not many people knew about it. She happened to be admitted sometime after the affair ended. A Nursing Unit Manager (NUM) who knew the patient, knew she felt herself to be at risk as an in-patient, so the NUM took it upon herself to hide the charts on at least one occasion to keep the doctor from finding out. No matter what position observers may take on the ethical minefield of that case, it's clear I think that programming EHR Access Control rules to cope with the nuances is probably an intractable problem.]
Allan, you say that "incomplete data is little use to any researcher" but I wonder if that's a bit extreme? It's a reality that patients' self reporting is always a little suspect; all medical research protocols need to be designed in light of the way patients filter what they say. Incomplete data is of enormous benefit if the studies are well designed to cope with the gaps.
I'm pretty sure that volunteer rates in medical research would plummet if we promised only to apply "good enough privacy protection".
Reply | Report Abuse | Link to thisOk, whilst the validity of oral histories and patients' self-reporting is of concern to medical practitioners, the issue to researchers is the record of medical history in the database which should not be subject to such vagaries. It needs to be reasonably complete and accurate otherwise meaningful statistical processing becomes impossible. As to "good enough privacy protection", in the UK we have a Biobank of half a million volunteers for research purposes. I confess when I joined I simply assumed that privacy measures would be good enough - it is anonymised - and privacy does not feature at all prominently in the web discussions.
Reply | Report Abuse | Link to thisAllan wrote: "I simply assumed that [Biobank] privacy measures would be good enough - it is anonymised - and privacy does not feature at all prominently in the web discussions".
Reply | Report Abuse | Link to thisThe work of Latanya Sweeney's that kicks off this Sci Am article shows that EHR designers' claims of anonymisation need to be reviewed, and why privacy really should be discussed some more.