As useful as it would be to interact with smartphones and other gadgets by chatting casually with them, the technology to enable such a simple but meaningful back-and-forth has proved elusive. Voice-controlled virtual assistants such as Amazon Alexa, Apple Siri and Google Assistant typically require users to make formal, carefully articulated requests while isolating themselves someplace with minimal background noise. The existing technology also suffers from its inability to go off-script, as its side of the conversation relies on a handful of preprogrammed responses.
The companies developing these voice assistants are painfully aware of their shortcomings. Apple appears to be ramping up hiring for Siri engineers to improve its offering whereas Google and Amazon have been busy expanding their voice assistants’ ability to perform multiple tasks—called “routines”—at a single command.
Amazon on Thursday introduced three more Alexa improvements that will be available by the end of May. One of the most significant is called “context carryover,” which will enable Alexa to recall information from voice request to another. A new memory feature will allow users to store and retrieve birthdays, anniversaries and other important information via voice commands. Amazon has also improved Alexa’s ability to search for and execute new “skills”—the voice-interface equivalent to smartphone apps. Asking Alexa how to remove an oil stain from a shirt, for example, will activate the “Tide Stain Remover” skill, which talks the user through a stain-removal process. Other skills enable Alexa users to check their Capital One credit card balances, obtain opening-bell stock prices or match a wine to their meals with just a few words.
Scientific American spoke with Ruhi Sarikaya, director of applied science at Amazon’s Alexa Machine Learning, who was set to deliver these announcements on Thursday during a keynote talk at an AI conference in Lyon, France. Sarikaya was also set to discuss how improvements in speech recognition and natural-language processing is helping streamline Alexa so the technology can better interpret what users want. Scientific American asked him why voice interfaces are so difficult to get right, when we can expect them to improve and how users can better protect the privacy of the personal data that Alexa collects.
[An edited transcript of the interview follows.]
What makes you say we are on the cusp of voice becoming the primary way we communicate with our devices?
Think about 1976, when [Apple co-founder] Steve Wozniak built the first PC with a monitor and a keyboard. Fast-forward to today, and people are still using a monitor and keyboard to interact with most of their devices. Even with smartphones you either type on or touch a screen to get output. This is a problem because it actually immobilizes us. Even though you might be walking around, your attention is still focused on a screen. That’s changing with voice—for three reasons: increases in computing power in smaller devices; the ability to collect and analyze large amounts of data; and advances in machine learning, in particular deep learning. Those types of AI algorithms are making speech recognition and natural-language understanding more accurate.
What have been the biggest challenges to making voice interfaces that work well with consumer tech?
There are component-level challenges and user-experience challenges with regard to speech recognition. But if the conditions are relatively quiet, it’s very accurate. If there’s background noise or multiple people are speaking, however, that’s a challenge that we still need to deal with. You want to be able track different voices when multiple people are speaking at the same time. With regard to helping devices understand natural language, context is the critical challenge. If a digital personal assistant is limited to just a few domains or functions—it’s dedicated to playing music, for example—it’s easy to understand the user’s intent. Add to that the responsibility of sifting through data about movies, videos and audiobooks, and all of a sudden the command, “Play X” becomes ambiguous. It could refer to content in any of those categories.
Why is context so important when interacting with smart devices?
If you and I are chatting right now, I might carry over information from the last time we spoke. We don’t need to repeat everything that we discussed previously in order to have a seamless conversation. That’s natural for people but not the case when speaking with machines, where you currently have to use precise wording to be understood. You would expect that if a machine is smart enough, it would be able to carry over information from an earlier conversation. If I ask, “Alexa, how is the weather in Seattle?” and then I ask, “How about this weekend?”, I expect to hear about the weather in Seattle this weekend without explicitly saying that in the second question. If I ask, “Alexa, what is my schedule for today?” the system responds using information stored in its calendar. If I ask, “How about this weekend?” I expect calendar information, not weather information, for this weekend. There is no right answer to that second question without context—there could be any number of answers. That’s referred to as “session context,” and it allows a machine to answer the question correctly based on the current conversation.
How does a machine learn context?
You start with the device receiving the voice command. You can’t play video on an Amazon Echo, so that narrows the device’s options when a user asks the device to play a particular title. You also have the device look at a user’s personal preferences, including previous requests and other commands that have been given to the device over time. That’s where machine learning comes into play.
How can you improve Alexa’s ability to pick out speech and understand words even when there is significant background noise?
That’s an open problem, although we are making progress. Having worked on developing voice technology in the past, I can say there are a few different approaches. One is focusing on cleaning, or removing, background noise and then performing speech recognition on the data that’s left. When you do that, though, a side effect is that you may remove some of the data related to the speech itself. Another technique is collecting as much of the sound in a particular environment as possible and having the system map, or identify, different sounds—whether they are background noise or speech. The challenge is there are so many different noises that it’s difficult to be able to identify where each of them is coming from, especially when the TV is on.
How does Amazon use the information it collects about Alexa users?
I can only speak to the machine-learning part of Alexa. Machine learning relies on data collected from Alexa users. We don’t use all of that data—we annotate certain types in order to teach Alexa to recognize different acoustic cues, tones (both male and female) and accents. Our customers are diverse, and we want Alexa to be able to recognize different users. We can’t build a technology that will work only for one type of voice.
How does Amazon address privacy concerns that people might have regarding Alexa?
Alexa stores information it has about its users in the cloud rather than on a device itself, such as an Echo or smartphone. Customers have the ability to delete any information that they want Alexa to forget using the Alexa app and the “Manage Your Content and Devices” page on Amazon’s site. You can, for instance, review voice interactions with Alexa and delete specific voice recordings associated with your account by visiting “History” in “Settings” in the Alexa app.