ADVERTISEMENT
See Inside August 2007

Playing It by Ear

A machine-listening system that understands three speakers at once

Prince Shotoku was a seventh-century politician attributed with authorship of Japan’s first constitution. Famed as a nation builder, he is said to have been able to listen to many people simultaneously, hearing the petitions of up to 10 supplicants at once and then handing down judgments or advice.

Inspired by the legendary prince, Japanese researchers have spent five years developing a humanoid robot system that can understand and respond to simultaneous speakers. They posit a restaurant scenario in which the robot is a waiter. When three people stand before the robot and simultaneously order pork cutlet meals or French dinners, the robot understands at about 70 percent comprehension, responding by repeating each order and giving the total price. This process takes less than two seconds and, crucially, requires no prior voice training.

Such auditory powers mark a fundamental challenge in artificial intelligence—how to teach machines to pick out significant sounds amid the hubbub. This is known as the cocktail party effect, and most machines do not do any better than humans who have had a few too many martinis. “It’s very difficult for a robot to recognize speakers in a noisy environment,” says Hiroshi G. Okuno of Kyoto University, the team leader and a pioneer in the field. Reverberations, extraneous sounds and other signal interruptions also pose difficulties.

Indeed, the era of easy natural-language communication with machines, a dream batted around at least since the time of Alan Turing, seems far off for the everyday user. A humorous example: Microsoft’s live demo last year of the Windows Vista speech-recognition feature, which mangled the salutation “Dear Mom” and a verbal attempt to fix an error, producing “Dear Aunt, let’s set so double the killer delete select all.”

In comparison, Okuno’s system is remarkably accurate and does not require speakers to wear a headset (unlike commercial speech-recognition programs), because the microphones are embedded in a robot. His so-called machine-listening program performs what is known as computational auditory scene analysis, which incorporates digital signal processing and statistical methods. It first locates the sources of the audio and then separates the sounds with computational filters. The next step is key: automatic missing-feature mask generation. This powerful technique masks auditory data, such as cross talk, that the system decides are unreliable as it tries to zero in on a particular speaker. The system then compares the processed information with an internal database of 50 million utterances in Japanese to figure out which words were spoken. When the filtered version of each speaker is played back, only a few sounds from the other speakers are audible.

The result is a robust robot listener that is closer to the human brain’s auditory powers than other systems. Okuno says it might handle as many as six talkers depending on their relative angles and the number of microphones utilized (currently eight). The robot can move and orient toward speakers, too, thereby enhancing performance.

“Okuno’s project for robots that can understand overlapping voices does a really nice job of combining the best ideas in multimicrophone source localization with the powerful technique of missing-feature speech recognition,” remarks Dan Ellis, head of Columbia University’s Laboratory for the Recognition and Organization of Speech and Audio. “What makes his work stand out from anything else is the commitment to solving all the practical problems that come up in a real-world deployment ... and making something that ... can enable a robot to understand its human interlocutors in real-world situations.”

Besides serving up fast food, Okuno’s robot could lead to a hearing prosthesis that is just as good at reducing noise interference. Such a device could be combined with a sophisticated automatic paraphrasing system, which would be more important because hearing-impaired people rely heavily on context in conversation, Okuno thinks. Okuno himself is nearly deaf without hearing aids after years of listening to loud music through headphones. “The current hearing capabilities of humanoid robots are similar to mine,” he chuckles.

[break]

This is only a preview. Get the rest of this article now!

Select an option below:

Customer Sign In

*You must have purchased this issue or have a qualifying subscription to access this content


It has been identified that the institution you are trying to access this article from has institutional site license access to Scientific American on nature.com.
Click here to access this article in its entirety through site license access.

Rights & Permissions
Share this Article:

Comments

You must sign in or register as a ScientificAmerican.com member to submit a comment.
Scientific American Holiday Sale

Black Friday/Cyber Monday Blow-Out Sale

Enter code:
HOLIDAY 2014
at checkout

Get 20% off now! >

X

Email this Article

X