Whether it is used to out-bluff world poker champions or schedule hairdresser appointments in a (mostly) convincing human voice, AI and its underlying machine-learning algorithms keep making big strides in their capabilities—and into ever-more intimate spaces of our lives. And, like any technological feat predicated on the collection and analysis of massive data sets, some of these breakthroughs come with significant privacy risks. New data-collecting techniques, however, could enable researchers to better preserve users’ privacy yet still glean valuable insights from their personal information.
Take digital assistants, where the fruits of AI innovation are increasingly manifested. Today, Amazon’s Alexa and Google Assistant distinguish between the voices of different people in your home, and can use these voice signatures to deliver personalized traffic reports and schedule appointments in the relevant speaker’s calendar. Pulling off such tricks requires sophisticated natural-language processing skills. It also requires access to very sensitive data. Location history, contacts, calendars, transcribed records of voice queries, online browsing and purchase histories—all this can go into training the AI that helps virtual assistants become more useful and personalized.
This presents a thorny question for the companies making them, particularly if they claim to take user privacy seriously. How does one create smart virtual assistants that understand individual user preferences, without snooping on users’ activities and putting their personal data at risk? “The first line of defense is anonymization and encryption of the data,” says Oren Etzioni, chief executive officer of the Allen Institute for Artificial Intelligence. “Anonymization so that it's not directly linked to you in an obvious way, and encryption so that an outside party can’t get access to that [information].”
In addition to offering full-disk encryption that encodes all of the information on a particular device to protect it from prying eyes, Apple and Google also rely on a statistical method known as local differential privacy to keep the data they mine from those devices anonymous. “The idea is that when it’s being collected from users on their laptops or smartphones, some amount of carefully calibrated noise is also added to the data,” says Anupam Datta, a Carnegie Mellon University professor of electrical and computer engineering. “That masked, noisy data from lots of users then gets encrypted and sent to Google or Apple servers to be parsed for meaningful results.” The companies might learn, for example, that a certain number of smartphones use a particular app at a given time of day—but the companies would not know the identities of those smartphones or their owners.
Apple says it uses this privacy safeguard for projects such as improving the intelligence and usability of features including QuickType word and emoji suggestions (pdf) in its operating system. Similarly, Google has used local differential privacy to remotely collect data from its Chrome browser operating on user devices (pdf). That process helps the company ferret out malware threats to its browser. Still, Datta warns it is a mistake to think of differential privacy as synonymous with total privacy. “It's a relative guarantee, not an absolute one,” he says.
In an effort to keep sensitive user data off of remote servers entirely, Google is experimenting with an approach called federated learning. Instead of collecting and sending data to train its machine-learning models, the company sends the models themselves directly to the user. You download the current training model to your smartphone, the model changes based on what it learns from your personal data and then this updated model goes back to the cloud and is averaged in with all the other updated models. At no point does Google see or collect your personal data.
Despite such measures, some security researchers still have reservations. They point out, for example, hackers can easily reverse engineer user data if they get access to the machine-learning models. Furthermore, strategies like differential privacy are really only useful for learning generalized trends about a large group or population, says University of Southern California professor and former Google research scientist Aleksandra Korolova. These strategies, she says, do not reveal key insights at the level of the individual—which is presumably what digital assistants would most need to do.
The biggest problem is that all these techniques effectively amount to building software walls around data, according to University of California, Berkeley, professor and security researcher Raluca Ada Popa. “Attackers always end up breaking into software, so these walls will never be a foolproof mechanism,” she says.
Popa and some of her colleagues at Berkeley's RISELab think they have a better solution for gleaning insights from highly personal data. Secure multiparty computation (pdf) purportedly would enable tech companies to gather the information they want from multiple encrypted data sources without those sources having to reveal their private data. In effect, researchers can study information contained in large encrypted data sets without needing to see the raw data in those sets. Whether it is finding better cancer treatment predictors or serving personalized ads and restaurant recommendations, companies would not even have to see—and could not, even if they wanted to—the underlying personal information. “I really think this is the future, because you don’t have to give up your data anymore to do all this useful AI and machine learning,” Popa says.
Amazon, Apple, Google, Microsoft and other big tech companies may have different business models and data-collection motivations, yet all have staked their futures on increasing the intelligence of their devices and services. That means their data-collection efforts will likely only increase—and as many people have discovered in recent months, the amount of personal information these companies already have is often staggering. Someday it may be possible to train highly accurate, highly personalized AI using private data without compromising user privacy. But for now, there is only one way to ensure your sensitive information does not end up in the wrong hands: Don’t share it.