# Privacy by the Numbers: A New Approach to Safeguarding Data

A mathematical technique called “differential privacy” gives researchers access to vast repositories of personal data while meeting a high standard for privacy protection

Suppose the true answer to one such question is 157. The differentially private algorithm will “add noise” to the true answer; that is, before returning an answer, it will add or subtract from 157 some number, chosen randomly according to a predetermined set of probabilities. Thus, it might return 157, but it also might return 153, 159 or even 292. The person who asked the question knows which probability distribution the algorithm is using, so she has a rough idea of how much the true answer has likely been distorted (otherwise the answer the algorithm spat out would be completely useless to her). However, she doesn’t know which random number the algorithm actually added.

The particular probability distribution being used must be chosen with care. To see what kind of distribution will ensure differential privacy, imagine that a prying questioner is trying to find out whether I am in a database. He asks, “How many people named Erica Klarreich are in the database?” Let’s say he gets an answer of 100. Because Erica Klarreich is such a rare name, the questioner knows that the true answer is almost certainly either 0 or 1, leaving two possibilities:

(a)   The answer is 0 and the algorithm added 100 in noise; or

(b)   The answer is 1 and the algorithm added 99 in noise.

To preserve my privacy, the probability of picking 99 or 100 must be almost exactly the same; then the questioner will be unable to distinguish meaningfully between the two possibilities. More precisely, the ratio of these two probabilities should be at most the preselected privacy parameter R. And this should be the case with regard to not only 99 and 100 but also any pair of consecutive numbers; that way, no matter what noise value is added, the questioner won’t be able to figure out the true answer.

A probability distribution that achieves this goal is the Laplace distribution, which comes to a sharp peak at 0 and gradually tapers off on each side. A Laplace distribution has exactly the property we need: There is some number R (called the width of the distribution) such that for any two consecutive numbers, the ratio of their probabilities is R.

There is one Laplace distribution for each possible width; thus, we can tinker with the width to get the Laplace distribution that gives us the exact degree of privacy we want. If we need a high level of privacy, the corresponding distribution will be comparatively wide and flat; numbers distant from 0 will be almost as probable as numbers close to 0, ensuring that the data are blurred by enough noise to protect privacy.

Inevitably, tension exists between privacy and utility. The more privacy you want, the more Laplace noise you have to add and the less useful the answer is to researchers studying the database. With a Laplace distribution, the expected amount of added noise is the reciprocal of Ɛ; so, for example, if you have chosen a privacy parameter of 0.01, you can expect the algorithm’s answer to be blurred by about 100 in noise.

The larger the dataset, the less a given amount of blurring will affect utility: Adding 100 in noise will blur an answer in the hundreds much more than an answer in the millions. For datasets on the scale of the Internet — that is, hundreds of millions of entries — the algorithm already provides good enough accuracy for many practical settings, Dwork said.

And the Laplace noise algorithm is only the first word when it comes to differential privacy. Researchers have come up with a slew of more sophisticated differentially private algorithms, many of which have a better utility-privacy trade-off than the Laplace noise algorithm in certain situations.

“People keep finding better ways of doing things, and there is still plenty more room for improvement,” Dwork said. When it comes to more moderate-sized datasets than the Internet, she said, “there are starting to be algorithms out there for many tasks.”

View
1. 1. AllanRBrewer 01:50 PM 12/31/12

We all have a great deal to gain by allowing our medical data to be shared - perhaps we should also put a price on openness.

2. 2. DaniEder 02:47 PM 12/31/12

A real-life example of adding random errors to data is from the early GPS network days. The satellites mis-reported their position slightly, so that users on the ground ended up with slightly incorrect locations. Military users, however, got a code to decrypt the position error sent by the satellites, and were able to determine their location more exactly. This allowed the general public to use GPS, but in wartime prevented the enemy from having as good an accuracy as your own military.

The random errors were changed frequently to prevent opponents from decrypting the corrections in enough time to be of use.

3. 3. kienhua68 03:45 PM 1/1/13

This is all well and good. Most odd is the amount of attention paid to being private.
conditions?
It is as though some overwhelming fear of exposure has
taken over.
Privacy, overdone, can lead to suspicion, often a heavier burden than exposure.
Do any of us want to be exposed to vagaries of chance as a consequence of 'privacy'?
Do any recent events ring a bell?
Privacy must have limits with regard to how it affects the rest of us.

4. 4. AllanRBrewer in reply to kienhua68 04:02 PM 1/1/13

@kienhua68: I do agree with you. Apart from the fear of a higher insurance policy if propensity to disease is revealed (which anyway can be legislated out by requirements on the insurance companies) what exactly are people trying keep private - this is a genuine question.

5. 5. allaphor 06:51 AM 1/2/13

The whole concept of privacy will soon become obsolete. As more and more data sources come on-line, most of which with no regard for privacy concerns, the private will cease to be... People are already on the verge of live streaming their whole lives, and not only their lives but also the fractions of life which they share with every other person. With the current development of big data tools, you will be able to know everything about anybody, using only the indiscretions of everybody else around them. The problem with privacy is asymmetry, it's me knowing something about you, while you know nothing about me. When everybody knows everything about everybody else, privacy will not only have disappeared but will have become irrelevant.

6. 6. Radguy 11:07 AM 1/3/13

I recently completed a Uni group project addressing these systems.

Medical records can include unflattering or worse images of yourself.

This data along with other data can be extremely revealing, potentially including biometrics.

I want the anonymous data to be fully accessible to doctors and researchers. I also see great value in other data relating to consumption (not the TB kind) being made available to all through independent government agencies, as businesses always run better with the lights on.

A trustable system is capable of bringing us not just "real democracy", but "real time democracy", impervious to undemocratic influences.

The more identifying information that can be kept from online networks, the better. Only now are businesses waking up to the lack of security within their online databases, as some small businesses have their data encrypted by hackers who then extort them for access. I have never trusted an online computer with my data. Keeping important stuff offline is a good insurance policy.

People have private phone numbers, emails, social networks and so on. These practices can save a lot of hassle. If some people want to let it all hang out, that's fine, but if you find you have an audience, I think you will appreciate some curtains.

If the existence of Stuxnet isn't enough to concern you regarding the security of online computers, check out the research going on which is opening up the possibility of spying on offline machines through detection of electric fields generated by keyboards, mice, displays and their cables. Of course, wireless is a pushover to those in the know.

Some see the problems we are trying to resolve as trivial. I see an unhackable identity system as being the most important piece of infrastructure in our future society. The other main problem we must confront is ensuring that forced access for law enforcement is always limited by an independent judiciary that adheres to an expanded charter of human rights (current one falls well short).

7. 7. justyntoo 04:22 PM 1/5/13

as soon as conditions are bad enough you will see those obamacare chips become manditory , then any scanner tuned to the right frequency can pick up any and all your info . enjoy your privacy till then . what man can invent , man can circumvent .

8. 8. StephenWilson 03:01 PM 1/6/13

It's funny when people predicting the end of privacy -- or actually calling for it -- post their comments under a pseudonym. See, everyone needs their privacy at some level, even if it's just closing the toilet door.
Less trivially, real discrimination exists and probably always will, around gender, sexual orientation, ethnicity and religion. Knowing this, many people have a particular interest in keeping some of their medical history confidential. Who really needs to know that you had an STD, an abortion, a misdemeanor, or a depressive episode in your teens or twenties?
Frequently, people who see no need for privacy have done well in picking their parents. White middle class middle aged men may not have had the same life experiences as other sections of the community and tend to have a different perspective on privacy.

9. 9. AllanRBrewer in reply to StephenWilson 04:20 PM 1/6/13

OK, a sensible comment; but STDs, abortion, misdemeanors and depressive episodes do not inform on those areas you have quoted as involving discrimination. Sure nobody "needs" to know these things (which do also happen to white middle-class people), but nobody is suggesting making such data freely available - we're just questioning whether hindering and slowing down medical research on important medical issues is worth it to efficiently hide such personal things?

10. 10. StephenWilson 05:07 PM 1/6/13

Thanks Allan. Yes, there are interesting compromises inherent in ethical medical research.
In my grab-bag of points (intended to collectively show how complex this all is), I was responding to a few peoples' remarks, such as kienhua68: "Has anyone asked just what is so sacred about medical conditions?" and allophor: "When everybody knows everything about everybody else, privacy will not only have disappeared but will have become irrelevant. "
That's frank nonsense, a naive fatalistic confusion of what is and what ought to be. International data protection (or "information privacy") laws anticipate the leaking and collecting of Personally Identifiable Information (PII) via many channels, and provide for sanctions against unauthorised use of PII *no matter where it comes from*. allophor says of Big Data "you will be able to know everything about anybody, using only the indiscretions of everybody else around them". What s/he may not appreciate is that under OECD privacy law in 80+ countries, PII that falls into your lap via "indiscretions" or indeed any third party channel is still protected. So it is just not the case that Big Data destroys privacy. Confidentiality and anonymity are not the same thing as privacy. Even if a second party manages to find out something about me, they are not free to do anything they like with that PII. See also http://lockstep.com.au/blog/2012/10/29/not-too-late-for-privacy.
Returning to medical research, I agree there's a balancing act. Human rights considerations are central to all ethical review processes.
People get emotional about privacy on both sides of the medical research debate, but here's a pragmatic consideration from the middle ground: patients are known to self censor their medical histories, especially when they don't trust the healthcare people and systems they're exposed to. Patients are naturally shy about their HIV status, their recreational drug use, their mental illness past or present. If we want reasonably complete data about people in medical research, then it is in everyone's interests to provide the very best privacy protection, so that participants have the requisite trust and confidence in what's going to happen to their data. In this light it is just abhorrent to hear people preach that privacy is over, and to cast dispersions on those who would choose to keep medical secrets.

11. 11. AllanRBrewer 05:42 PM 1/6/13

Your last point surprises me - I wasn't aware that people have any ability to edit their medical history, certainly here in the UK. I agree incomplete data is little use to any researcher. Perhaps "reasonable" or "good enough" privacy protection could be achieved more quickly and at less cost than your "very best privacy protection".

12. 12. StephenWilson in reply to AllanRBrewer 10:51 PM 1/6/13

Sorry, when I said 'self censor their medical histories', I was referring in the main to people holding details back when a medico takes an oral history. Patients simply don't tell their doctors everything, but they reveal more to doctors they trust. Which is why many countries have women's health clinics, and in Australia we have dedicated Aboriginal and Torres Strait Islander health services.
Actual editing of databases is not unknown. In Australia our proposed longitudinal EHR system now in development will include the ability for patients to elect which parts of their record are visible to certain medicos. This is a controversial feature to be sure. On balance the Dept of Health here was persuaded by patients rights groups over the protestations of doctors' groups.
[Sidebar: This reminds me of a special case which shows how difficult it is to make generalisations about health privacy. When a patient is admitted to a hospital where their ex-spouse works as a medico, there are usually special protocols (often informal) to keep the two parties apart. In small towns, this is a real and present problem. I have done privacy consulting in these sorts of places. It is very difficult to codify the rules, and therefore tricky to develop reliable EHR access control algorithms.
One case I saw first hand involved a female nurse who had previously had a clandestine affair with a male doctor in a country hospital. Not many people knew about it. She happened to be admitted sometime after the affair ended. A Nursing Unit Manager (NUM) who knew the patient, knew she felt herself to be at risk as an in-patient, so the NUM took it upon herself to hide the charts on at least one occasion to keep the doctor from finding out. No matter what position observers may take on the ethical minefield of that case, it's clear I think that programming EHR Access Control rules to cope with the nuances is probably an intractable problem.]
Allan, you say that "incomplete data is little use to any researcher" but I wonder if that's a bit extreme? It's a reality that patients' self reporting is always a little suspect; all medical research protocols need to be designed in light of the way patients filter what they say. Incomplete data is of enormous benefit if the studies are well designed to cope with the gaps.

13. 13. StephenWilson in reply to AllanRBrewer 11:11 PM 1/6/13

I'm pretty sure that volunteer rates in medical research would plummet if we promised only to apply "good enough privacy protection".

14. 14. AllanRBrewer in reply to StephenWilson 04:47 PM 1/7/13

Ok, whilst the validity of oral histories and patients' self-reporting is of concern to medical practitioners, the issue to researchers is the record of medical history in the database which should not be subject to such vagaries. It needs to be reasonably complete and accurate otherwise meaningful statistical processing becomes impossible. As to "good enough privacy protection", in the UK we have a Biobank of half a million volunteers for research purposes. I confess when I joined I simply assumed that privacy measures would be good enough - it is anonymised - and privacy does not feature at all prominently in the web discussions.

15. 15. StephenWilson in reply to AllanRBrewer 01:38 AM 1/8/13

Allan wrote: "I simply assumed that [Biobank] privacy measures would be good enough - it is anonymised - and privacy does not feature at all prominently in the web discussions".
The work of Latanya Sweeney's that kicks off this Sci Am article shows that EHR designers' claims of anonymisation need to be reviewed, and why privacy really should be discussed some more.

You must sign in or register as a ScientificAmerican.com member to submit a comment.
Click one of the buttons below to register using an existing Social Account.

## More from Scientific American

• Quick and Dirty Tips | 1 hour ago

### 8 Tips to Treat Restless Legs Syndrome

• Scientific American Mind | 1 hour ago

### Nonverbal Cues Could Boost Kids' Vocabulary

• Scientific American Magazine | 3 hours ago

### Less Is More When Restoring Wetlands

• News | 3 hours ago

### Bee Researchers Make Friends with a Killer

• News | 4 hours ago | 2

## Latest from SA Blog Network

• ### Dark Energy, Dark Universe

Plugged In | 24 minutes ago
• ### Wolves Can Learn From Humans. What Does That Mean For Dogs?

MIND
The Thoughtful Animal | 1 hour ago
• ### Extraordinary Footage From Starship Juno

Life, Unbounded | 2 hours ago
• ### Tornadoes May Be Getting Stronger. Or Not

Observations | 3 hours ago
• ### Wordless Wednesday: #DNLeeLab Research Snapshots 2

The Urban Scientist | 4 hours ago

Privacy by the Numbers: A New Approach to Safeguarding Data

X

Give a 1 year subscription as low as \$14.99

X

X

###### Welcome, . Do you have an existing ScientificAmerican.com account?

No, I would like to create a new account with my profile information.

X

Are you sure?

X