
Image: John R. Hershey, Peder A. Olsen, Steven J. Rennie and Andy Aaron
More In This Article
-
Overview
Solving the Cocktail Party Problem
-
The Best Science Writing Online 2012
Showcasing more than fifty of the most provocative, original, and significant online essays from 2011, The Best Science Writing Online 2012 will change the way...
Read More »
The year is 1974, and Harry Caul is monitoring a couple walking through a crowded Union Square in San Francisco. He uses shotgun microphones to secretly record their conversation, but at a critical point, a nearby percussion band drowns out the conversation. Ultimately Harry has to use an improbable gadget to extract the nearly inaudible words, “He’d kill us if he got the chance,” from the recordings.
This piece of audio forensics was science fiction when it appeared in the movie The Conversation more than three decades ago. Is it possible today?
Sorting out the babble from multiple conversations is popularly known as the “cocktail party problem,” and researchers have made many inroads toward solving it in the past 10 years. Human listeners can selectively tune out all but the speaker of interest when multiple speakers are talking. Unlike people, machines have been notoriously unreliable at recognizing speech in the presence of noise, especially when the noise is background speech. Speech recognition technology is becoming increasingly ubiquitous and is now being used for dictating text and commands to computers, phones and GPS devices. But good luck getting anything but gibberish if two people speak at once.
A flurry of recent research has focused on the cocktail party problem. In 2006, Martin Cooke of the University of Sheffield in England and Te-Won Lee of the University of California, San Diego, organized a speech separation “challenge,” a task designed to compare different approaches to separating and recognizing the mixed speech of two talkers. Since then, researchers around the world have built systems to compete against one another and against the ultimate benchmark: human listeners.
Here, we survey the computational challenges of speech separation and outline the techniques used to tackle the problem. In particular we describe the workings of the “superhuman” algorithm that three of us (along with our colleague Trausti T. Kristjansson of Google) entered in the separation challenge. We then describe a subsequent algorithm, which can efficiently solve more complicated separation problems with more than two speakers that would take eons to unravel with the original approach. (See also "Solving the Cocktail Party Problem" from the April 2011 issue of Scientific American.)
1. Try it Yourself
To get an idea of what the speech separation algorithms are up against, see if you can hear the target words in some overlapping speech of the kind used in the challenge. All the sentences spoken in the samples use a very limited vocabulary and have the same simple structure as this example: “Place red at C two now.” (The sentences may seem less strange if you imagine they are instructions about what to do with colored tokens in a board game.)
In each mixture, one of the talkers mentions “white.” Your goal is to discern the letter-number combination (“C two” in the example) spoken in the sentence about “white.”
- Click arrows for audio
- Mix 1: (different gender)
- MP3 file
- Target speaker 1:
- MP3 file
- Masking speaker 1:
- MP3 file
- Mix 2: (same gender)
- MP3 file
- Target speaker 2:
- MP3 file
- Masking speaker 2:
- MP3 file
- Mix 3: (same person)
- MP3 file
- Target speaker 3:
- MP3 file
- Masking speaker 3:
- MP3 file
The limited vocabulary and simplified grammar allow the research to focus on the challenge of separating overlapping speech without requiring the infrastructure needed for recognizing more complicated utterances. The algorithms processed a few thousand such test samples, which varied in several ways. In some samples, the “target” and “masking” talker were equally loud, but mostly they differed slightly or moderately in volume. The target and masker could have different genders or the same gender, or they could even be the same person speaking both sentences. Human listeners have the greatest trouble when the target is the same person, speaking at about the same or slightly lower volume than the masker.
2. How spectrograms represent speech

- Click arrow for audio
- MP3 file
To separate the speech of multiple talkers or to recognize one person’s speech, computers represent the sound signal by its spectrum—the energy in the sound at each frequency. A spectrogram displays how the spectrum varies over time, with the color at each point indicating the sound energy at that frequency and time. The spectrogram conveys all the information needed to recognize speech. In fact, computer scientist Victor Zue of the Massachusetts Institute of Technology used to teach a class on how to transcribe speech by just looking at a spectrogram.
To produce a spectrogram, software divides the sound signal into short, overlapping time segments called frames, each about 40 milliseconds long (1/25th of a second). The overlap avoids losing information at the start and end of each frame. The sound spectrum is determined for each frame. Thus a spectrogram is a series of individual spectra, one for each frame. Speech recognition and speech separation typically works by moving along a spectrogram one frame at a time.
3. Spectrogram of overlapping speech

Mixing audio sources together is a little like pouring milk into coffee. Once they blend together, there is no simple way of separating them. In each time frame, the spectrum from each source essentially adds together. In principle, dividing the sound into two parts is as arbitrary as asking, “If x plus y equals 10, what are x and y?”
At a real cocktail party, you get some extra information by having two ears. The slightly different sound detected by each ear tells you something about the directions the sounds are coming from, which can help you in picking out one talker from the crowd. But you get no such assistance if the two people are in the same general direction and neither does a computer processing a recording made with a single microphone. The speech separation challenge focused on this “monaural” version of the problem.
Fortunately, as is apparent by looking at spectrograms, the sounds of speech have a lot of structure. All approaches to speech separation exploit this structure to some degree.




See what we're tweeting about


8 Comments
Add CommentOne thing the article didn't talk about (or I didn't catch it) was that we humans hear in sterio. Or at least those of us that have 2 good ears do. I don't so I have a very hard time distinguishing between different voices the same as computers seem to. Humans with sterio sound capability can lock in on a persons voice in 3D space and filter out anything outside of that space. A good way to test that is to have someone listen to people talking in a group of people and comparing what they heard with a mono recording of that same group of people from the same location. I think that a person would have the same problem with discriminating between different people talking at the same time from the recording as the computer does.
Reply | Report Abuse | Link to thisAt least that has been my experiance.
Joenn
Joenn, you're right my short summary (Solving the Cocktail Party Problem) didn't address this issue. In this longer piece (Audio Alchemy) the researchers spell out that they're looking at the "monoaural" case where there's only one microphone and thus no directional information about the talkers.
Reply | Report Abuse | Link to thisThere is a new separation "challenge" on at the moment that focuses on the task of extracting speech from recordings made "in a noisy living room" using two microphones: http://www.dcs.shef.ac.uk/spandh/chime/challenge
Results are going to be discussed at a conference in September.
As a native Japanese/English bilingual speaker, I've always thought it was easier to follow multiple overlapping conversations when they were spoken in different languages. I don't know whether this is because the spectral properties of the languages are different, or because I'm using different parts of my brain to understand the different languages simultaneously, but I would be interested to find out.
Reply | Report Abuse | Link to thisThat's an interesting comment. In our work we have found that differences such as gender, pitch and volume made it easier for our algorithm to separate the audio. Probably this is not a property of the algorithm, but rather of how the problem becomes easier the more different the signals are. If two twins are talking at the same time and volume and pause at the same time the computer can't possibly know who continues what -- short of using some language information.
Reply | Report Abuse | Link to thisExcellent article...very much enjoyed...
Reply | Report Abuse | Link to thisI am extremely hard of hearing. I use hearing aids and must give an external FM transmitter to a close companion in order to have a conversation with that companion in a crowded dinning room. My wife on the other hand has excellent hearing and discrimination. I am always astounded at her ability to discriminate and "listen in" to someone speaking several tables distant.
Reply | Report Abuse | Link to thisI especially have trouble hearing, and maybe even thinking, in a cocktail party situation. I wear two hearing aids. It's always seemed like if they could communicate with each other, they could recognise what sounds are in phase, therefore coming from directly in front of me, and amplify those sounds over others. These days they should be able to build the needed computer systems into the frames of my glasses.
Reply | Report Abuse | Link to thisI experienced (total) sudden hearing loss in one ear 2 years ago. I can now no longer "place" sounds: I can't tell where a noise is coming from. And in noisy environments, conversation is almost impossible. In noisy restaurants, even if I'm directly looking at the person speaking, I have great difficulty understanding what is said. My thoughts: with 2 ears, our brain constructs an "aural dome" of our environment, giving us a lot of information about our surroundings. Without these directional cues, it's all just noise -- which is my situation now. Seems to me solving the cocktail party problem could be vastly improved by a 2-microphone (i.e., stereo) recording. And breaking up the noise by separating it into tracks from discrete directions would seem to go a long way toward splitting the "noise" into multiple speech sources.
Reply | Report Abuse | Link to this