
Image: Illustration by Bryan Christie
In Brief
- Computers cannot yet solve the “cocktail party problem”—understanding speech when two or more people are talking at the same time.
- A number of groups are making good progress, though, using various methods.
- A multimedia feature, which is available at www.ScientificAmerican.com/apr2011/speech, describes the logic behind one leading approach in detail and allows you to test your own ability to separate overlapping streams of chatter.
More In This Article
You are at a party, and Alex is telling a boring story. You are much more interested in the gossip that Sam is recounting to Pat, so you tune out Alex and focus on Sam’s words. Congratulations: you have just demonstrated the human ability to solve the “cocktail party problem”—to pick out one thread of speech from the babble of two or more people. Computers so far lack that power.
Although automated speech recognition is increasingly routine, it fails when faced with two people talking at once. Computerized speech separation would not only improve speech-recognition systems, it could also advance many other endeavors that require the separating of signals, such as making sense of brain-scan images.
This article was originally published with the title Solving the Cocktail Party Problem.
Already a Digital subscriber? Sign-in Now
If your institution has site license access, enter here.



See what we're tweeting about






2 Comments
Add CommentTo solve the cocktail party problem, computers would first be able to analyze and recognize the peculiar wavelength character of one conversation. Then, they should be able to pick it out when immersed in the midst of two or more conversations. Seems doable.
Reply | Report Abuse | Link to thisHumans achieve the cocktail party effect with substantial assistance from binaural hearing -- that is, the brain is adept at focusing attention on sound sources from specific and consistent spatial positions, which are determined largely by the .3 to .7 msec time delay between the two ears. Algorithms that emulate this approach, noting the time differential between two mics, should be able to fix the sound source in space, and then filter sounds from other locations.
Reply | Report Abuse | Link to thisAn opposite effect is achieved in some movie theatres, where the speakers are wired with improper phase alignment. The resulting sound presents pseudo time-delay information to the nervous system's binaural decoding faculty, rendering speech relatively unintelligible, presumably because it seems to be coming randomly from multiple sources and the attention can't get a fix on it.