The world produces roughly 2.5 quintillion bytes of digital data per day, adding to a sea of information that includes intimate details about many individuals’ health and habits. To protect privacy, data brokers must anonymize such records before sharing them with researchers and marketers. But a new study finds it is relatively easy to reidentify a person from a supposedly anonymized data set—even when that set is incomplete.
Massive data repositories can reveal trends that teach medical researchers about disease, demonstrate issues such as the effects of income inequality, coach artificial intelligence into humanlike behavior and, of course, aim advertising more efficiently. To shield people who—wittingly or not—contribute personal information to these digital storehouses, most brokers send their data through a process of deidentification. This procedure involves removing obvious markers, including names and social security numbers, and sometimes taking other precautions, such as introducing random “noise” data to the collection or replacing specific details with general ones (for example, swapping a birth date of “March 7, 1990” for “January–April 1990”). The brokers then release or sell a portion of this information.
“Data anonymization is basically how, for the past 25 years, we’ve been using data for statistical purposes and research while preserving people’s privacy,” says Yves-Alexandre de Montjoye, an assistant professor of computational privacy at Imperial College London and co-author of the new study, published this week in Nature Communications. Many commonly used anonymization techniques, however, originated in the 1990s, before the Internet’s rapid development made it possible to collect such an enormous amount of detail about things such as an individual’s health, finances, and shopping and browsing habits. This discrepancy has made it relatively easy to connect an anonymous line of data to a specific person: if a private detective is searching for someone in New York City and knows the subject is male, is 30 to 35 years old and has diabetes, the sleuth would not be able to deduce the man’s name—but could likely do so quite easily if he or she also knows the target’s birthday, number of children, zip code, employer and car model.
In the past several years, Montjoye and other researchers have published studies that reidentified individuals from sets such as anonymized shopping data or health records. Some contend that the risk of reidentification is relatively low because these sets often reflect only a fraction of the population—which creates uncertainty that any particular person is included in the list. But the new study developed a statistical model to calculate the possibility that any entry of nameless data can be connected to their true identity. The research found that doing so is disturbingly easy, even when one is working with an incomplete data set.
“In the U.S., on average, if you have 15 characteristics (including age, gender or marital status), that is enough to reidentify Americans in any anonymized data set 99.98 percent of the time,” Montjoye says. Although 15 pieces of demographic information may sound like a lot, it represents a drop in the bucket in terms of what is really out there: in 2017 a marketing analytics company landed in hot water for accidentally publishing an anonymized data set that contained 248 attributes for each of 123 million American households.
How much of a risk does this pose to your personal data? For the new study, the research team created a digital tool that allows individual Internet users to see how likely they are to be reidentified from an anonymous info dump. According to this tool, its average user has an 83 percent risk of reidentification. And one has little recourse when it comes to opting out of information collection. “A paranoid consumer could stop posting anything online at all, stop using the Internet, not use any apps, abandon cell phone use, not use credit cards—but it’s really not practical to do that in this day and age,” says Jennifer Cutler, an associate professor of marketing at the Kellogg School of Management at Northwestern University, who was not involved in the new study. “Our lives today are largely online, and there are always trade-offs to be made. There’s a reason why policy makers haven’t completely clamped down and restricted any data sharing it all. And it’s because data sharing and these models can be used for great good.”
Instead of outlawing data collection altogether, Montjoye suggests data brokers need to develop new anonymization techniques and test them rigorously to make sure a third party cannot identify individuals based on personal statistics. “The issue is mostly with current practices when it comes to anonymization,” he says. “At the moment, we only see the tip of the iceberg, but it’s worrisome that it’s not achieving its goal of preventing reidentification. The standards need to be higher, and the practices need to be reviewed.”
Because individuals have such scant recourse, some believe holding data brokers to a higher standard may require new legislation. “Since it’s anonymous, data collectors don’t have to ask data subjects for their consent, so you don’t know whether your data is being collected and shared with third parties,” says study co-author Luc Rocher, a Ph.D. candidate at Catholic University of Louvain in Belgium. “I think, here, it’s more a question of the responsibility of regulations to better protect our personal data.”
Cutler agrees that research-backed legislation will be necessary. “Interdisciplinary researchers and policy makers really need to continue to do work, like what was done in this paper,” to create evidence-based regulations, she says, “so that we can manage the healthiest balance of innovation and progress while still protecting users as much as we can.”