“Sorry, I didn't hear you.”

This may be the first empathetic utterance by a commercial machine. In the late 1990s the Boston company SpeechWorks International began supplying companies with customer-service software programmed to use this phrase and others. In the years since, we have become accustomed to talking to machines. Nearly every call to a customer-service line begins with a conversation with a robot. Hundreds of millions of people carry an intelligent personal assistant around in their pocket. We can ask Siri and other such assistants to find restaurants, call our friends or look for a song to play. They are capable of simulating eerily human behavior. (Human: “Siri, do you love me?” Siri: “I am not capable of love.”)

But machines do not always respond the way we would like them to. Speech-recognition software makes mistakes. Machines often fail to understand intention. They do not get emotion and humor, sarcasm and irony. If in the future we are going to spend more time interacting with machines—and we are, whether they are intelligent vacuum cleaners or robotic humanoid nurses—we need them to do more than understand the words we are saying: we need them to get us. We need them, in other words, to “understand” and share human emotion—to possess empathy.

In my laboratory at the Hong Kong University of Science and Technology, we are developing such machines. Empathetic robots can be of great help to society. They will not be mere assistants—they will be companions. They will be friendly and warm, anticipating our physical and emotional needs. They will learn from their interactions with humans. They will make our lives better and our jobs more efficient. They will apologize for their mistakes and ask for permission before proceeding. They will take care of the elderly and teach our children. They might even save your life in critical situations while sacrificing themselves in the process—an act of ultimate empathy.

Some robots that mimic emotion are already on the market—including Pepper, a small humanoid companion built by the French firm Aldebaran Robotics for the Japanese company Softbank Mobile, and Jibo, a six-pound desktop personal-assistant robot designed by a group of engineers that included Roberto Pieraccini, former director of dialog technologies at SpeechWorks. The field of empathetic robotics is still in its steam-engine days, but the tools and algorithms that will dramatically improve these machines are emerging.

The empathy module

I became interested in building empathetic robots six years ago, when my research group designed the first Chinese equivalent of Siri. I found it fascinating how naturally users developed emotional reactions to personal-assistant systems—and how frustrated they became when their machines failed to understand what they were trying to communicate. I realized that the key to building machines that could understand human emotion were speech-recognition algorithms like those I had spent my 25-year career developing.

Any intelligent machine is, at its core, a software system consisting of modules, each one a program that performs a single task. An intelligent robot could have one module for processing human speech, one for recognizing objects in images captured by its video camera, and so on. An empathetic robot has a heart, and that heart is a piece of software called the empathy module. An empathy module analyzes facial cues, acoustic markers in speech and the content of speech itself to read human emotion and tell the robot how to respond.

When two people communicate with each other, they automatically use a variety of cues to understand the other person's emotional state—they interpret facial gestures and body language; they perceive changes in tone of voice; they understand the content of speech. Building an empathy module is a matter of identifying those characteristics of human communication that machines can use to recognize emotion and then training algorithms to spot them.

When my research group set out to train machines to detect emotion in speech, we decided to teach machines to recognize fundamental acoustic features of speech in addition to the meaning of the words themselves because this is how humans do it. We rarely think of it in these terms, but human communication is signal processing. Our brain detects emotion in a person's voice by paying attention to acoustic cues that signal stress, joy, fear, anger, disgust, and so on. When we are cheerful, we talk faster, and the pitch of our voice rises. When we are stressed, our voices become flat and “dry.” Using signal-processing techniques, computers can detect these cues, just as a polygraph picks up blood pressure, pulse and skin conductivity. To detect stress, we used supervised learning to train machine-learning algorithms to recognize sonic cues that correlate with stress.

A brief recording of human speech might contain only a few words, but we can extract vast amounts of signal-processing data from the tone of voice. We started by teaching machines to recognize negative stress (distress) in speech samples from students at my institution, which students have given the nickname “Hong Kong University of Stress and Tension.” We built the first-ever multilingual corpus of natural stress emotion in English, Mandarin and Cantonese by asking students 12 increasingly stressful questions. By the time we had collected about 10 hours of data, our algorithms could recognize stress accurately 70 percent of the time—remarkably similar to human listeners.

While we were doing this work, another team within my group was training machines to recognize mood in music by analyzing sonic features alone (that is, without paying attention to lyrics). Mood, as opposed to emotion, is an ambiance that lasts over the duration of the music played. This team started by collecting 5,000 pieces of music from all genres in major European and Asian languages. A few hundred of those pieces had already been classified into 14 mood categories by musicologists.

We electronically extracted some 1,000 fundamental signal attributes from each song—acoustic parameters such as energy, fundamental frequency, harmonics, and so on—and then used the labeled music to train 14 different software “classifiers,” each one responsible for determining whether a piece of music belongs to a specific mood. For example, one classifier listens to only happy music, and another listens only for melancholy music. The 14 classifiers work together, building on one another's guesses. If a “happy” classifier mistakenly finds a melancholic song to be happy, then in the next round of relearning, this classifier will be retrained. At each round, the weakest classifier is retrained, and the overall system is boosted. In this manner, the machine listens to many pieces of music and learns which one belongs to which mood. In time, it is able to tell the mood of any piece of music by just listening to the audio, like most of us do. Based on this research, former students and I started a company called Ivo Technologies to build empathetic machines for people to use at home. The first product, Moodbox, will be a smart home infotainment center that controls the music and lighting in each room and responds to user emotions.

Understanding intent

To understand humor, sarcasm, irony and other high-level communication attributes, a machine will need to do more than recognize emotion from acoustic features. It will also need to understand the underlying meaning of speech and compare the content with the emotion with which it was delivered.

Researchers have been developing advanced speech recognition using data gathered from humans since the 1980s, and today the technology is quite mature. But there is a vast difference between transcribing speech and understanding it.

Think of the chain of cognitive, neurological and muscular events that occurs when one person speaks to another: one person formulates her thoughts, chooses her words and speaks, and then the listener decodes the message. The speech chain between humans and machine goes like this: speech waves are converted into digital form and then into parameters. Speech-recognition software turns these parameters into words, and a semantic decoder transforms words into meaning.

When we began our research on empathetic robots, we realized that algorithms similar to those that extract user sentiment from online comments could help us analyze emotion in speech. These machine-learning algorithms look for telltale cues in the content. Key words such as “sorrow” and “fear” suggest loneliness. Repeated use of telltale colloquial words (“c'mon,” for example) can reveal that a song is energetic. We also analyze information about the style of speech. Are a person's answers certain and clear or hesitant, peppered with pauses and hedged words? Are the responses elaborate and detailed or short and curt?

In our research on mood recognition in music, we have trained algorithms to mine lyrics for emotional cues. Instead of extracting audio signatures of each piece of music, we pulled strings of words from the song's lyrics and fed them to individual classifiers, each one responsible for determining whether this string of words conveys any of the 14 moods. Such strings of words are called n-grams. In addition to word strings, we also used part-of-speech tags of these words as part of the lyrics “signature” for mood classification. Computers can use n-grams and part-of-speech tags to form statistical approximations of grammatical rules in any language; these rules help programs such as Siri recognize the speech and software such as Google Translate convert text into another language.

Once a machine can understand the content of speech, it can compare that content with the way it is delivered. If a person sighs and says, “I'm so glad I have to work all weekend,” an algorithm can detect the mismatch between the emotion cues and the content of the statement and calculate the probability that the speaker is being sarcastic. Similarly, a machine that can understand emotion and speech content can pair that information with other inputs to detect more complex intentions. If someone says, “I'm hungry,” a robot can determine the best response based on its location, time of day and the historical preferences of its user, along with other parameters. If the robot and its user are at home and it is almost lunchtime, the robot might know to respond: “Would you like me to make you a sandwich?” If the robot and its user are traveling, the machine might respond: “Would you like for me to look for restaurants?”

Zara the supergirl

At the beginning of this year students and postdoctoral researchers in my lab began pulling all our various speech-recognition and emotion-recognition modules together into a prototype empathetic machine we call Zara the Supergirl. It took hundreds of hours of data to train Zara, but today the program runs on a single desktop computer. For the moment, she is a virtual robot, represented on a screen by a cartoon character.

When you begin a conversation with Zara, she says, “Please wait while I analyze your face”; Zara's algorithms study images captured by the computer's webcam to determine your gender and ethnicity. She will then guess which language you speak (Zara understands English and Mandarin and is learning French) and ask you a few questions in your native tongue. What is your earliest memory? Tell me about your mother. How was your last vacation? Tell me a story with a woman, a dog and a tree. Through this process, based on your facial expressions, the acoustic features of your voice and the content of your responses, Zara will reply in ways that mimic empathy. After five minutes of conversation, Zara will try to guess your personality and ask you about your attitudes toward empathetic machines. This is a way for us to gather feedback from people on their interactions with early empathetic robots.

Zara is a prototype, but because she is based on machine-learning algorithms, she will get “smarter” and more empathetic as she interacts with more people and gathers more data. Right now her database of knowledge is based only on interactions with graduate students in my lab. Next year we plan to give Zara a body by installing her in a humanoid robot.

It would be premature to say that the age of friendly robots has arrived. We are only beginning to develop the most basic tools that emotionally intelligent robots would need. And when Zara's descendants begin arriving on the market, we should not expect them to be perfect. In fact, I have come to believe that focusing on making machines perfectly accurate and efficient misses the point. The important thing is that our machines become more human, even if they are flawed. After all, that is how humans work. If we do this right, empathetic machines will not be the robot overlords that some people fear. They will be our caregivers, our teachers and our friends.