As security concerns mount, networks proliferate and ever more data move online, personal privacy and anonymity are often the first casualties. For the Insights story, "A Little Privacy, Please," appearing in the August 2007 issue of Scientific American, Chip Walter sat down with Carnegie Mellon computer scientist Latanya Sweeney, who discusses the new threats to privacy and ways to fight identity theft and other misuse of information.
Why is privacy versus security becoming such a problem? Why should we even care?
(Laughs) Well, one issue is we need privacy. I don't mean political issues. We literally can't live in a society without it. Even in nature animals have to have some kind of secrecy to operate. For example, imagine a lion that sees a deer down at a lake and it can't let the deer know he's there or [the deer] might get a head start on him. And he doesn't want to announce to the other lions [what he has found] because that creates competition. There's a primal need for secrecy so we can achieve our goals.
Privacy also allows an individual the opportunity to grow and make mistakes and really develop in a way you can't do in the absence of privacy, where there's no forgiving and everyone knows what everyone else is doing. There was a time when you could mess up on the east coast and go to the west coast and start over again. That kind of philosophy was revealed in a lot of things we did. In bankruptcy, for example. The idea was, you screwed up, but you got to start over again. With today's technology, though, you basically get a record from birth to grave and there's no forgiveness. And so as a result we need technology that will preserve our privacy.
How did you get into this line of work? What drew you to mathematics and computer science?
When I was a kid [in about third or fourth grade], one of my earliest memories, was wanting to build a black box that could learn as fast as I could. We could learn together, or it could teach me. I wanted the sort of teaching-learning experience that could go as fast and as deep as I could go.
What triggered this black box fantasy?
In hindsight, I think I was bored in school, because I would finish the assignments and would have to wait for the rest of the class. I think it was an outlet and I began spending hours fantasizing about this box. It [eventually] became a real passion ¿ so when I went on to high school and took my first computer course that childhood vision and this sort of natural interest with computer programming just melded together .
After high school, you went to M.I.T. How did it feel being one of the few girls in a predominantly male college?
I first went to M.I.T. in 1977. But it was a tough adjustment. I came from a top-notch prep school for girls and going from that environment to M.I.T.--well, it's almost impossible to be more opposite, in every possible way. It was huge, it was in the city, I couldn't sleep it was [so] loud. Oh man.
But the thing that really made it hard for me was the faculty; I had a lot of incidents with the faculty that were really obnoxious.
What kind of obnoxious incidents are you talking about?
The way M.I.T. is structured is that in your freshman year the lectures are in huge halls with over 100 students, and then you go into smaller groups on the same subject that had only 10 to 12 students. Every week there were 10 problems in a problem set. So the guys [in our group] came to me and said "look we're going to start a study group' and I said, "what's a study group?" And they said, "well, every week we're going to get 10 problems, and every week one of us will get assigned a problem and the day before the homework is due, we'll all meet and your job is to tell the rest of the group your solution and then they don't copy it down, you explain it to them and they go write it up themselves, or if we don't think that's the right solution, we'll discuss it." And I said, "Oh okay. That sounds good to me."
Basically you had 10 people turning in the same assignment. So everyone gets the assignments back and they got 10 out of 10, 10 out of 10, 10 out of 10, seven out of 10. So I go to the instructor and I ask him, "Why would you give me seven out of 10?" And he says, "Well you didn't show enough of the work to show the process." So [I try again and] scores are 10 out of 10, 10 out of 10, 10 out of 10, seven out of 10. I ask the instructor again, "Why did I get seven out of 10?" And he says, "Well you showed so much detail that it seems you didn't really understand the concepts." So then I go through this thing where I'm trying to get the right amount of detail.
Did you ever figure out the real reason behind those seven out of 10s?
One day we had these resistance cubes -- this is an engineering class -- and they have colors on them and the colors tell how much resistance is inside the canister. We had to memorize these color codes. So in one class the instructor says, "The way I remembered the resistor [color codes] when I was a lad at M.I.T. was the following: 'Black boys rape only young girls but Violet gives willingly." When he said that I think I understood the [reason behind the] seven out of 10.
Later I left M.I.T. and started my own computer company for 10 years. Then I went to Harvard, and then went from Harvard to back M.I.T. as a graduate student.
What was it like to go back to the same department?
When I returned, that teacher was the head of the department. And it was really funny, but you know what, when I got back my attitude was "I'm not taking any crap." And I had absolutely no problems in my graduate career. But in my undergraduate years I was definitely not prepared for what I had to deal with there.
And now you are head of the Data Privacy Lab at Carnegie Mellon University. Why did you create it?
One day I was in grad school and [in my research] I came across this letter that roughly said: at the age of two the patient was sexually molested, at the age of three she stabbed her sister with scissors, at the age of four, her parents got divorced, at the age of five she set fire to her home. And then I realized there was nothing in that description that [would be changed by] scrubbing out identifiable information. I'll bet you there's only one person with that experience. And that made me realize that identifiability is really fascinating, and it made me realize that I didn't understand a thing about privacy. Removing the explicit identifiers wasn't what it was about. I realized there's a lot more to this than a notion of what makes me identifiable.
And it was then that I started realizing that privacy in the data space is a little bit different. It requires tracking people where they go. And when all this technology began exploding, you begin to realize that it's huge.
So what makes your lab different from others that look into these issues?
I started the lab to do what I call "research by fire." We don't operate like a think tank or tackle abstract problems. If you have a real world crisis, you can come to our lab, give us a little money, and we will solve your problem. But because these are real world problems, it really is research by fire. We don't have the luxury of sitting back and speculating and thinking. The judge needs a decision and an answer now, otherwise so-and-so is going to sue. So companies and government agencies give us grants as partners in the lab and they give us problems that need solutions within a given time period and the goal is to solve those problems.
What kinds of problems do you tackle?
All kinds, from DNA privacy, video piracy to problems with losing revenue streams, being sued or filing suit. A lot of the technologies [we have developed] came from that sort of work.
We roll up our sleeves and ask "How do I learn really sensitive information? How do I exploit the heck out of the data that seems so innocent out there?" And if we're really good at doing that then we can create strategies for controlling privacy abuse.
When a problem comes to us, whether it's bioterrorism or something else, we find ourselves doing a deep dive into that policy setting or regulatory environment, usability issues and even the business issues. We have to take on all of these constraints and come up with a solution, and often that's a new technology, sometimes it's just a patch, very rarely is it just a recommendation. And that's what we do.
Your Identity Angel software is able to gather up disconnected pieces of information about people from data available all over the Internet. How does it work?
It is very easy to do scans for individuals from information that is publicly available or freely given away or sold for a cost. That means you don't have to break into a system to get data you're not supposed to have; it means you can gather the information from what is already out there.
[Earlier in my career] I had learned that if I had the date of birth, gender and a five-digit zip code of a person, I could identify 87 percent of the people in the United States. So even if you don't give me your social security number, I can find out who you are nearly nine out of 10 times.
That led to Identity Angel?
One of the things we suspected at the lab was that people in their early 20s with credit cards were especially vulnerable to identity theft. Our lab began looking at this, and we found that that is a time in people's lives when they're not very stable. Their addresses are changing continuously, and so if you were to [steal an identity and] get a credit card in their name, the fact that an address had changed was not something that would trigger a red flag.
What else makes 20-somethings especially vulnerable to identity theft?
The other thing is that they don't have a lot of prior credit records, and credit card companies are really anxious to give them credit cards. At the same time there is a lot of information about them on the internet since they're in that age group where they are used to creating web pages on Facebook and MySpace. A lot of the information also came from students routinely releasing their information by putting it in their résumés. Why would anybody put a social security number on his or her résumé? But they did.
All of this simplified creating a fraudulent student credit card -- a name and address ¿ a social security number, and date of birth. The challenge of Identity Angel was to find and combine this information from the internet . It mines information including resumes off the Internet and looks for ones that have the information, social security number, date of birth, etc. -- enough information to get a credit card in the person's name.
What does Identity Angel do with that mined information?
If it succeeds, the software then tries to find an email address and send [the victim] an email letting them know we found this information.
You also developed a program called k-anonymity. What would be an example of its use?
We have a project with the Department of Housing and Urban Development. They want to know where people have been without knowing who they are. And in this case, they never want to know who they are. So I built this system that allows them to do this. It is actually tracking the homeless. Congress appropriated a large amount of money in 2004 to create the Homeless Management Information System. And the idea of the system is to track the service utilizations of the homeless, and that's because there are a whole lot of questions about homelessness and they want a system that gathers that information.
Congress says this is about money. The cost of homelessness is exploding. Is this because there are too many homeless, because they are eating too much food, or because there is fraud in the system? What's actually happening?
Why do homeless people need privacy protection?
There is one special class of homelessness for which privacy became critical, and that's domestic violence victims, and it turns out they account for a huge percentage of the dollars spent in the system. They are afraid that the person stalking them,So they wanted to be able to track people, but do it in a way that even if you knew all of the intimate details about that person, even if you got access to the data, you still couldn't identify that person.
This required a deep dive into cryptography. The earlier "scrub system" I had developed [ such as Identity Angel] was all about text. That's just text mining. But this led us into different areas ¿- video, face identification, etc. which is a deep dive into computer graphics and computer vision.
So how do we solve the privacy problem? What are the best and worst-case scenarios?
My answer is that the privacy problems that I've seen are probably best solved by the person who first created the technology. What we really have to do is train engineers and computer scientists to design and build technologies in the right kind of way from the beginning.
Normally, engineers and computers scientists get ideas for technologies on their own and engage in a kind of circular thinking and develop a prototype of their solution and then do some kind of testing. But we are saying we will give them tools that help them see who are the stakeholders and do a risk assessment, and then see what barriers will come up and deal with the riskiest problems and work to solve them in the technology design.
I think if we are successful in producing a new breed of engineers and computer scientists, society will really benefit. The whole technology-dialectics thing is really aiming at how you should go about teaching engineers and computer scientists to think about user acceptance and social adoption [and also that they] have to think about barriers to technology [from the beginning].
So the best scenario is that this kind of training takes hold and as new technologies emerge they are less likely to be constantly clashing with accept-or-reject options.
Is it hard to break down those cultural barriers and change the way people work?
There should be privacy technology departments, because there are no technologies for handling privacy problems [proactively]. The best solutions lie in the technology design. So we are targeting the creation of tools for the engineers and computer scientists, to give them software tools that help them work in a way they are already used to working and give them a way to gather all of the right information and then infuse it in their designs.
And a lot of the time, the financial model isn't there to do it. Sometimes society gets so annoyed at what happens, and it ends up on the front page of the New York Times. The reaction isn't always rational. Policy doesn't have the nuances of the technology.
If we build the right designs in up front, then society can decide how to turn those controls on and off. But if the technology is built without controls, it forces us to either accept the benefits of the technology without controls, or cripple it by adding them later.
Several years ago, Scott McNealy, the CEO of Sun Microsystems, famously quipped, "Privacy is dead. Get over it."
Oh privacy is definitely not dead. When people say you have to choose, it means they haven't actually thought the problem through or they aren't willing to accept the answer.
Remember, it's in [McNealy's] interest to say that, because he very much shares that attitude of the computer scientist who built the technology that's invasive; who says, "Well, you want the benefits of my technology, you'll get over privacy". It's exactly the kind of computer scientist we don't want to be graduating in the future. We want the computer scientist who will resolve these kinds of barriers in conflict, identify them and resolve them in their technology design.
So where do you see the big problems?
It really is pretty much everywhere. Identity management is a critical problem that we just keep ignoring. Social security numbers are a whole discussion unto themselves -- how they've outplayed themselves, do they need to be replaced? Now in law enforcement and the department of justice, they're saying it should be fingerprints. So now we'll see little devices in computers and cars and even refrigerators with very expensive fingerprint readers on them. But that's a problem because fingerprints could become the next social security number. They could give us all the ills of the social security number and worse. I can't get rid of my fingerprint, it goes with me wherever I go. I don't wear my social security number on my head.
How would it be stolen, though? What specifically would you see as the problem with fingerprints?
Well, we leave them everywhere, which is really good for law enforcement because they know where to find us at all times, but that means that anyone could pick them up. The point is you can see the progression. Fingerprint databases will proliferate all over and that will create problems. Someone could access the database and replicate your fingerprint and make a card, which wouldn't really be their card, it'd be yours.
So this leaves more and more bits of data about all of us out there on the Internet, including email?
Yes, and you can tell a lot about a person that way, you can even impersonate them. That's another thing I expect to see over the next five years. Thieves will do a little research on you, impersonate you and maybe send an email to someone you know to elicit funds because now they have more information about you.
Medical privacy is a sensitive area, too.
The big vulnerabilities there come from the insurance companies and employers, people who ultimately pay the medical bills. Those parties have an interest in knowing what you have been diagnosed with and making decisions that impact your employment or income. There was an article written about a banker in Maryland who used to cross a cancer registry with people who had loans and mortgages with his bank, and would then call in those loans. Now the story was retracted, because it was being debated whether it was true or not. But the person who was responsible for the story showed me lots of documentation that showed it was true. But the point I am making is that true or false, it is certainly easy to do. And you can see the financial incentives. So the problem with Scott McNealy's approach, the trusted agent approach, is that to the extent that they are the only party to see the data, maybe society can trust them. But the truth is ¿ you aren't the only party [who can get your hands on the information], and what you are advocating is not just for you but a lot of parties that you simply can't be accountable for.
DNA data is becomng more widely available now. If you only have the DNA of a person and nothing else, could you find out who that person is?
In one project, we chose to look at patients with Huntington's disease, because it's easy to spot in the DNA. One part of the DNA will repeat, and that's normal, but if you have Huntington's it repeats a huge amount of times. Also the longer the repeat, the earlier the age of onset of the disease. So we could make a prediction about the person's age at the time they were diagnosed with it. These were all Huntington's patients in the state of Illinois. We then used hospital discharge information that was publicly available and looked for diagnoses of [discharged] Huntington's patients and began to match them to the DNA to identify those people. We successfully matched 20 out of 22 people. That was shocking.
Are we postponing the privacy problem, or are we confronting it?
A lot of the surveillance can be done with privacy protections. But under the current administration, those in Homeland Security call it the 'P word'. Their statement is that as long as you don't say the P word you don't have a P problem, whether you do or you don't. So the FBI gets slapped in the wrist for gathering all of this additional data, but a lot of that could have been anonymized. But right now, there is no funding or interest in using these technologies at all.