After great promise in the 1960s that machines would soon think like humans, progress stalled for decades. Only in the past 10 years or so has research picked up, and now there are several popular products on the market that do a decent job of at least recognizing spoken speech. For Björn Schuller, full professor and head of the chair of Complex and Intelligent Systems at the University of Passau, Germany, who grew up watching Knight Rider—a television show about a car that could talk—this is the fulfillment of a childhood fantasy. Schuller is a World Economic Forum Young Scientist who will speak at the World Economic Forum’s Annual Meeting of the New Champions in Tianjin, China, from June 26 to 28.He recently spoke about the possibility of machines soon tuning in to human language quirks, behavior and emotion.
[An edited transcript of the interview follows.]
How did you get interested in machine intelligence and speech recognition?
I was watching Knight Rider, the television series from the ’80s, as a child, and I was very much attached to the idea that machines should be talking with humans to the level at which they can understand emotion.
How does the kind of voice recognition software used in Siri, Cortana, Echo and other products work?
There are two parts. One part is dealing with speech recognition and synthesis, which is traditionally rooted more in signal processing. The other part deals with natural language processing, which is based more on textual information and interpretation. From the acoustics of the voice, spoken signals target words or even the meaning of words. So, for example, Cortana and Amazon Echo combine these two things and they are essentially spoken dialogue systems. They can control an acoustic signal over a textual representation, where they try to understand from the words what’s going on and produce a chain of words to say something meaningful.
What are the limitations of these technologies?
While their current state is already impressive, systems like Cortana, Siri and Amazon Echo, in my opinion, are very much lacking in terms of going beyond the spoken word. One of my major areas of expertise is paralinguistics. This is anything in the voice or words that gives us information about the speaker’s state and traits, such as emotion, the personality of the speaker, the age of the speaker, gender of the speaker, even the height of the speaker. When we talk, we are not just listening to each other’s intention, but at the same time maybe you’re listening to what age I am, or what my accent is.
Are you optimistic about further breakthroughs?
In machine learning and artificial intelligence, we’ve always seen a sort of pattern. Every now and then there is a new push forward in the field, a new success and new breakthrough, which is significant. Then maybe those expectations have been disappointed to some degree. Maybe every 10 years there is new big push forward.
I am very excited at the moment about all that is happening, because to me, 17 years later, since I really started to do research on this, it is a really exciting moment to see how spoken dialogue systems have found their way to use. We will very soon see systems gain emotional and social intelligence. Are we tired? Do we have a cold? Are we eating at the moment? These kinds of things are really giving us all sorts of insight to machine comprehension, behavior and social behavior. This might even be a game changer for society.
This interview was produced in conjunction with the World Economic Forum.