The year is 1974, and Harry Caul is monitoring a couple walking through a crowded Union Square in San Francisco. He uses shotgun microphones to secretly record their conversation, but at a critical point, a nearby percussion band drowns out the conversation. Ultimately Harry has to use an improbable gadget to extract the nearly inaudible words, “He’d kill us if he got the chance,” from the recordings.
This piece of audio forensics was science fiction when it appeared in the movie The Conversation more than three decades ago. Is it possible today?
Sorting out the babble from multiple conversations is popularly known as the “cocktail party problem,” and researchers have made many inroads toward solving it in the past 10 years. Human listeners can selectively tune out all but the speaker of interest when multiple speakers are talking. Unlike people, machines have been notoriously unreliable at recognizing speech in the presence of noise, especially when the noise is background speech. Speech recognition technology is becoming increasingly ubiquitous and is now being used for dictating text and commands to computers, phones and GPS devices. But good luck getting anything but gibberish if two people speak at once.
A flurry of recent research has focused on the cocktail party problem. In 2006, Martin Cooke of the University of Sheffield in England and Te-Won Lee of the University of California, San Diego, organized a speech separation “challenge,” a task designed to compare different approaches to separating and recognizing the mixed speech of two talkers. Since then, researchers around the world have built systems to compete against one another and against the ultimate benchmark: human listeners.
Here, we survey the computational challenges of speech separation and outline the techniques used to tackle the problem. In particular we describe the workings of the “superhuman” algorithm that three of us (along with our colleague Trausti T. Kristjansson of Google) entered in the separation challenge. We then describe a subsequent algorithm, which can efficiently solve more complicated separation problems with more than two speakers that would take eons to unravel with the original approach. (See also "Solving the Cocktail Party Problem" from the April 2011 issue of Scientific American.)
1. Try it Yourself
To get an idea of what the speech separation algorithms are up against, see if you can hear the target words in some overlapping speech of the kind used in the challenge. All the sentences spoken in the samples use a very limited vocabulary and have the same simple structure as this example: “Place red at C two now.” (The sentences may seem less strange if you imagine they are instructions about what to do with colored tokens in a board game.)
In each mixture, one of the talkers mentions “white.” Your goal is to discern the letter-number combination (“C two” in the example) spoken in the sentence about “white.”
The limited vocabulary and simplified grammar allow the research to focus on the challenge of separating overlapping speech without requiring the infrastructure needed for recognizing more complicated utterances. The algorithms processed a few thousand such test samples, which varied in several ways. In some samples, the “target” and “masking” talker were equally loud, but mostly they differed slightly or moderately in volume. The target and masker could have different genders or the same gender, or they could even be the same person speaking both sentences. Human listeners have the greatest trouble when the target is the same person, speaking at about the same or slightly lower volume than the masker.
2. How spectrograms represent speech
To separate the speech of multiple talkers or to recognize one person’s speech, computers represent the sound signal by its spectrum—the energy in the sound at each frequency. A spectrogram displays how the spectrum varies over time, with the color at each point indicating the sound energy at that frequency and time. The spectrogram conveys all the information needed to recognize speech. In fact, computer scientist Victor Zue of the Massachusetts Institute of Technology used to teach a class on how to transcribe speech by just looking at a spectrogram.
To produce a spectrogram, software divides the sound signal into short, overlapping time segments called frames, each about 40 milliseconds long (1/25th of a second). The overlap avoids losing information at the start and end of each frame. The sound spectrum is determined for each frame. Thus a spectrogram is a series of individual spectra, one for each frame. Speech recognition and speech separation typically works by moving along a spectrogram one frame at a time.
3. Spectrogram of overlapping speech
Mixing audio sources together is a little like pouring milk into coffee. Once they blend together, there is no simple way of separating them. In each time frame, the spectrum from each source essentially adds together. In principle, dividing the sound into two parts is as arbitrary as asking, “If x plus y equals 10, what are x and y?”
At a real cocktail party, you get some extra information by having two ears. The slightly different sound detected by each ear tells you something about the directions the sounds are coming from, which can help you in picking out one talker from the crowd. But you get no such assistance if the two people are in the same general direction and neither does a computer processing a recording made with a single microphone. The speech separation challenge focused on this “monaural” version of the problem.
Fortunately, as is apparent by looking at spectrograms, the sounds of speech have a lot of structure. All approaches to speech separation exploit this structure to some degree.