Playing It by Ear

A machine-listening system that understands three speakers at once

Join Our Community of Science Lovers!

Prince Shotoku was a seventh-century politician attributed with authorship of Japan’s first constitution. Famed as a nation builder, he is said to have been able to listen to many people simultaneously, hearing the petitions of up to 10 supplicants at once and then handing down judgments or advice.

Inspired by the legendary prince, Japanese researchers have spent five years developing a humanoid robot system that can understand and respond to simultaneous speakers. They posit a restaurant scenario in which the robot is a waiter. When three people stand before the robot and simultaneously order pork cutlet meals or French dinners, the robot understands at about 70 percent comprehension, responding by repeating each order and giving the total price. This process takes less than two seconds and, crucially, requires no prior voice training.

Such auditory powers mark a fundamental challenge in artificial intelligence—how to teach machines to pick out significant sounds amid the hubbub. This is known as the cocktail party effect, and most machines do not do any better than humans who have had a few too many martinis. “It’s very difficult for a robot to recognize speakers in a noisy environment,” says Hiroshi G. Okuno of Kyoto University, the team leader and a pioneer in the field. Reverberations, extraneous sounds and other signal interruptions also pose difficulties.

On supporting science journalism

If you're enjoying this article, consider supporting our award-winning journalism by subscribing. By purchasing a subscription you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today.

Indeed, the era of easy natural-language communication with machines, a dream batted around at least since the time of Alan Turing, seems far off for the everyday user. A humorous example: Microsoft’s live demo last year of the Windows Vista speech-recognition feature, which mangled the salutation “Dear Mom” and a verbal attempt to fix an error, producing “Dear Aunt, let’s set so double the killer delete select all.”

In comparison, Okuno’s system is remarkably accurate and does not require speakers to wear a headset (unlike commercial speech-recognition programs), because the microphones are embedded in a robot. His so-called machine-listening program performs what is known as computational auditory scene analysis, which incorporates digital signal processing and statistical methods. It first locates the sources of the audio and then separates the sounds with computational filters. The next step is key: automatic missing-feature mask generation. This powerful technique masks auditory data, such as cross talk, that the system decides are unreliable as it tries to zero in on a particular speaker. The system then compares the processed information with an internal database of 50 million utterances in Japanese to figure out which words were spoken. When the filtered version of each speaker is played back, only a few sounds from the other speakers are audible.

The result is a robust robot listener that is closer to the human brain’s auditory powers than other systems. Okuno says it might handle as many as six talkers depending on their relative angles and the number of microphones utilized (currently eight). The robot can move and orient toward speakers, too, thereby enhancing performance.

“Okuno’s project for robots that can understand overlapping voices does a really nice job of combining the best ideas in multimicrophone source localization with the powerful technique of missing-feature speech recognition,” remarks Dan Ellis, head of Columbia University’s Laboratory for the Recognition and Organization of Speech and Audio. “What makes his work stand out from anything else is the commitment to solving all the practical problems that come up in a real-world deployment ... and making something that ... can enable a robot to understand its human interlocutors in real-world situations.”

Besides serving up fast food, Okuno’s robot could lead to a hearing prosthesis that is just as good at reducing noise interference. Such a device could be combined with a sophisticated automatic paraphrasing system, which would be more important because hearing-impaired people rely heavily on context in conversation, Okuno thinks. Okuno himself is nearly deaf without hearing aids after years of listening to loud music through headphones. “The current hearing capabilities of humanoid robots are similar to mine,” he chuckles.

Okuno expects much wider applications. “In the near future, many appliances will have microphones embedded in them,” he predicts—and will do a lot more than ask if you want fries with that.

It’s Time to Stand Up for Science

If you enjoyed this article, I’d like to ask for your support. Scientific American has served as an advocate for science and industry for 180 years, and right now may be the most critical moment in that two-century history.

I’ve been a Scientific American subscriber since I was 12 years old, and it helped shape the way I look at the world. SciAm always educates and delights me, and inspires a sense of awe for our vast, beautiful universe. I hope it does that for you, too.

If you subscribe to Scientific American, you help ensure that our coverage is centered on meaningful research and discovery; that we have the resources to report on the decisions that threaten labs across the U.S.; and that we support both budding and working scientists at a time when the value of science itself too often goes unrecognized.

In return, you get essential news, captivating podcasts, brilliant infographics, can't-miss newsletters, must-watch videos, challenging games, and the science world's best writing and reporting. You can even gift someone a subscription.

There has never been a more important time for us to stand up and show why science matters. I hope you’ll support us in that mission.

Thank you,

David M. Ewalt, Editor in Chief, Scientific American