Earlier this month, former NSA employee Edward Snowden revealed the agency is collecting data on millions on Americans, from phone call durations to Facebook posts, all through a program codenamed PRISM. The resulting media backlash has revived the debates about internet privacy and government surveillance techniques, but questions remain: how is the National Security Agency taking in the data, and how much of a threat to our civil liberties does such data-collection efforts pose?

To find out, Scientific American spoke with metadata expert Mark Herschberg, CTO at Madison Logic and instructor at the Massachusetts Institute of Technology. Herschberg has worked with the programs used to gather big data and was able to elucidate how our internet data has become an important—if not invasive—commodity.

An edited transcript of the interview follows.

What kind of software can be used to collect big data?

It can be collected a number of different ways. It sounds like the NSA is going into the servers of Facebook and others and accessing their log files through some kind of "backdoor." In that case, you can write pretty simple programs that can copy those files and transfer them to local servers. You could also get this data by installing spying software on an individual’s computer. The third option is actually listening through the pipe, the digital version of wiretapping.

When an individual downloads something or visits a webpage, all the data go through the Internet service provider. By effectively wiretapping their line, you can see every single byte they're sending back and forth. 

Why would the NSA want access to things like Facebook posts?

What you can get are signals. For example, if you look at teens committing suicide, they often spend a lot of time thinking about it and planning. There are signs that professionals have been trained to look for, such as saying goodbye to people before they take their own life.

Similarly, you might see signals among the bad guys before they're going to do something. They might change their habits. These are things that professionals in counterterrorism can interpret.

You can also use personal data to find out where people have been and what they're doing. Photos taken with cellphones contain geotagging information: I can look at a photo that someone has posted and know exactly where that photo was taken.

We can also see who is talking to whom and observe important changes. I'll give you an analogy: suppose there's a house, and there's people going in and out. I can't actually see the people going in and out, but I can see the number of cars that pull up to the house. Say I normally see two to three cars a day and I suddenly see 20 cars pull up today. That tells me something is going on. Even if you can't get the detailed information, just looking at changes in communication patterns can alert you that something is happening.

There’s a lot of information to collect and store. How can the NSA possibly catalogue all of this data?

I suspect the NSA doesn't have large files on each and every one of us. I'm sure on particularly well-known targets they have information, but I presume you and I aren’t on watch lists.

Brewster Kahle produced a model that said if someone took all the domestic phone calls made in one year and put them into cloud storage, it would cost roughly $27 million a year to store. That's chump change for agencies like the NSA, the Department of Defense and others. Storage is getting so cheap these days that we can store mountains of data at relatively low cost.

When you think about the number of phone calls one can make in a year, there's an upper bound to that. Technology doesn't let me speak any faster—we're not making a much higher magnitude of calls than we were a few years ago, but the storage capacity has gone up many times. Our storage is outpacing our ability to produce information. We are sending more emails than we were years ago, but at a certain point, I can only bang out emails so fast. But the ability to store those emails? That continues to increase exponentially.

Do you see a future where entities store all the information we produce?

It's not the future. It's the present, and it’s called Google, it's called Yahoo, it's called Facebook. Facebook already has every IM you've ever sent [through Facebook]. Google has saved all those emails you've been sending [through Gmail]. They have it, they've indexed it, and they’ve generated models on you. This isn't the future; this is the last few years.

Is that data collection for advertising purposes?

Definitely. In advertising, retailers build certain models. If every week I'm buying beer and chips, and then suddenly they see me buying a pregnancy test, and then they see me buying diapers. They can say "Oh, okay. Single life is over. We know what's going on, we're going to send this guy info about baby products." Everyone is doing predictive modeling.

On that note, do you see a privacy issue here?

There's a huge privacy issue. There's this great video from the ACLU about ordering a pizza in the future that sums it up. There aren't really any data privacy laws in place. In terms of what a website or retailer can track about you, I'm not aware of any laws on that subject.

We as individuals generally value that information at zero. In studies tresearchers have asked people: "For this particular website or service, we'll give you two choices: you can either pay for it, or you get it free, but it comes with ads." Everyone took it for free with ads. What they don't realize, or what they know and don't care about, is that those ads come with tracking cookies. They're collecting these massive data sets on us, and we as Americans just don't mind. Both culturally and legally, we don't seem to care. I think that's very unfortunate.

Is there anything unquestionably positive we can do with data collection beyond counterterrorism and advertising?

Big data is a tool, and like any tool, it can be used for good or bad. The Internet can be used to disseminate information on an unimaginable scale, and it can be used for child porn. So it’s really in the hands of the people who use it.

We can build models like we never have before. In crime, New York is famous for the Compstat system, where police look at what crimes occurred and when and where it happened. They allocate the police force based on that. A more efficient police force is wonderful for society.

Again, these models can be used for good or bad. Those police could be used to stop the criminals, or in the extreme police state, those officers could be used to crack down on dissenters.

In the end, companies are collecting an obscene amount of data on us, and I think that’s just as much a threat to individuals as government data collection might be.