Earlier this month, former NSA employee Edward Snowden revealed the agency is collecting data on millions on Americans, from phone call durations to Facebook posts, all through a program codenamed PRISM. The resulting media backlash has revived the debates about internet privacy and government surveillance techniques, but questions remain: how is the National Security Agency taking in the data, and how much of a threat to our civil liberties does such data-collection efforts pose?
To find out, Scientific American spoke with metadata expert Mark Herschberg, CTO at Madison Logic and instructor at the Massachusetts Institute of Technology. Herschberg has worked with the programs used to gather big data and was able to elucidate how our internet data has become an important—if not invasive—commodity.
An edited transcript of the interview follows.
What kind of software can be used to collect big data?
It can be collected a number of different ways. It sounds like the NSA is going into the servers of Facebook and others and accessing their log files through some kind of "backdoor." In that case, you can write pretty simple programs that can copy those files and transfer them to local servers. You could also get this data by installing spying software on an individual’s computer. The third option is actually listening through the pipe, the digital version of wiretapping.
When an individual downloads something or visits a webpage, all the data go through the Internet service provider. By effectively wiretapping their line, you can see every single byte they're sending back and forth.
Why would the NSA want access to things like Facebook posts?
What you can get are signals. For example, if you look at teens committing suicide, they often spend a lot of time thinking about it and planning. There are signs that professionals have been trained to look for, such as saying goodbye to people before they take their own life.
Similarly, you might see signals among the bad guys before they're going to do something. They might change their habits. These are things that professionals in counterterrorism can interpret.
You can also use personal data to find out where people have been and what they're doing. Photos taken with cellphones contain geotagging information: I can look at a photo that someone has posted and know exactly where that photo was taken.
We can also see who is talking to whom and observe important changes. I'll give you an analogy: suppose there's a house, and there's people going in and out. I can't actually see the people going in and out, but I can see the number of cars that pull up to the house. Say I normally see two to three cars a day and I suddenly see 20 cars pull up today. That tells me something is going on. Even if you can't get the detailed information, just looking at changes in communication patterns can alert you that something is happening.
There’s a lot of information to collect and store. How can the NSA possibly catalogue all of this data?
I suspect the NSA doesn't have large files on each and every one of us. I'm sure on particularly well-known targets they have information, but I presume you and I aren’t on watch lists.
Brewster Kahle produced a model that said if someone took all the domestic phone calls made in one year and put them into cloud storage, it would cost roughly $27 million a year to store. That's chump change for agencies like the NSA, the Department of Defense and others. Storage is getting so cheap these days that we can store mountains of data at relatively low cost.