The year is 1974, and Harry Caul is monitoring a couple walking through a crowded Union Square in San Francisco. He uses shotgun microphones to secretly record their conversation, but at a critical point, a nearby percussion band drowns out the conversation. Ultimately Harry has to use an improbable gadget to extract the nearly inaudible words, “He’d kill us if he got the chance,” from the recordings.
This piece of audio forensics was science fiction when it appeared in the movie The Conversation more than three decades ago. Is it possible today?
Sorting out the babble from multiple conversations is popularly known as the “cocktail party problem,” and researchers have made many inroads toward solving it in the past 10 years. Human listeners can selectively tune out all but the speaker of interest when multiple speakers are talking. Unlike people, machines have been notoriously unreliable at recognizing speech in the presence of noise, especially when the noise is background speech. Speech recognition technology is becoming increasingly ubiquitous and is now being used for dictating text and commands to computers, phones and GPS devices. But good luck getting anything but gibberish if two people speak at once.
A flurry of recent research has focused on the cocktail party problem. In 2006, Martin Cooke of the University of Sheffield in England and Te-Won Lee of the University of California, San Diego, organized a speech separation “challenge,” a task designed to compare different approaches to separating and recognizing the mixed speech of two talkers. Since then, researchers around the world have built systems to compete against one another and against the ultimate benchmark: human listeners.
Here, we survey the computational challenges of speech separation and outline the techniques used to tackle the problem. In particular we describe the workings of the “superhuman” algorithm that three of us (along with our colleague Trausti T. Kristjansson of Google) entered in the separation challenge. We then describe a subsequent algorithm, which can efficiently solve more complicated separation problems with more than two speakers that would take eons to unravel with the original approach. (See also "Solving the Cocktail Party Problem" from the April 2011 issue of Scientific American.)
1. Try it Yourself
To get an idea of what the speech separation algorithms are up against, see if you can hear the target words in some overlapping speech of the kind used in the challenge. All the sentences spoken in the samples use a very limited vocabulary and have the same simple structure as this example: “Place red at C two now.” (The sentences may seem less strange if you imagine they are instructions about what to do with colored tokens in a board game.)
In each mixture, one of the talkers mentions “white.” Your goal is to discern the letter-number combination (“C two” in the example) spoken in the sentence about “white.”
- Click arrows for audio
- Mix 1: (different gender)
- MP3 file
- Target speaker 1:
- MP3 file
- Masking speaker 1:
- MP3 file
- Mix 2: (same gender)
- MP3 file
- Target speaker 2:
- MP3 file
- Masking speaker 2:
- MP3 file
- Mix 3: (same person)
- MP3 file
- Target speaker 3:
- MP3 file
- Masking speaker 3:
- MP3 file
The limited vocabulary and simplified grammar allow the research to focus on the challenge of separating overlapping speech without requiring the infrastructure needed for recognizing more complicated utterances. The algorithms processed a few thousand such test samples, which varied in several ways. In some samples, the “target” and “masking” talker were equally loud, but mostly they differed slightly or moderately in volume. The target and masker could have different genders or the same gender, or they could even be the same person speaking both sentences. Human listeners have the greatest trouble when the target is the same person, speaking at about the same or slightly lower volume than the masker.
2. How spectrograms represent speech
- Click arrow for audio
- MP3 file
To separate the speech of multiple talkers or to recognize one person’s speech, computers represent the sound signal by its spectrum—the energy in the sound at each frequency. A spectrogram displays how the spectrum varies over time, with the color at each point indicating the sound energy at that frequency and time. The spectrogram conveys all the information needed to recognize speech. In fact, computer scientist Victor Zue of the Massachusetts Institute of Technology used to teach a class on how to transcribe speech by just looking at a spectrogram.
To produce a spectrogram, software divides the sound signal into short, overlapping time segments called frames, each about 40 milliseconds long (1/25th of a second). The overlap avoids losing information at the start and end of each frame. The sound spectrum is determined for each frame. Thus a spectrogram is a series of individual spectra, one for each frame. Speech recognition and speech separation typically works by moving along a spectrogram one frame at a time.
3. Spectrogram of overlapping speech
Mixing audio sources together is a little like pouring milk into coffee. Once they blend together, there is no simple way of separating them. In each time frame, the spectrum from each source essentially adds together. In principle, dividing the sound into two parts is as arbitrary as asking, “If x plus y equals 10, what are x and y?”
At a real cocktail party, you get some extra information by having two ears. The slightly different sound detected by each ear tells you something about the directions the sounds are coming from, which can help you in picking out one talker from the crowd. But you get no such assistance if the two people are in the same general direction and neither does a computer processing a recording made with a single microphone. The speech separation challenge focused on this “monaural” version of the problem.
Fortunately, as is apparent by looking at spectrograms, the sounds of speech have a lot of structure. All approaches to speech separation exploit this structure to some degree.
4. Beads on a string model of speech
Computers encode the structure of speech in models, which typically represent thousands of words, each word built from a multitude of possible sounds. The most commonly used speech model is the so called beads-on-a-string model.
The pronunciation of a word is represented as a series of basic sound units called phonemes. For instance, “white” can be represented by the phoneme sequence “W-AY-T” (“AY” is used to represent the sound of the letter “i” in “white”). Each phoneme is further divided into a sequence of states—the “beads”—which represent how the sound power spectrum changes over the duration of a phoneme.
The duration of each phoneme state will vary depending on the speaker and the manner of speech. Speech models encode this variability in duration using probabilities: If the person is saying “white” and the sound in one time frame is phoneme state AY-1, the model specifies the probability of the sound switching to state AY-2 in the next frame.
Probabilities also enter the picture because each phoneme state can correspond to many different possible spectra in a frame of the spectrogram, again depending on all the different ways that someone could produce the sound—tonal inflections, the stress on the syllable, the overall pitch and timbre of the voice and so on.
5. Hidden Markov model
The full speech recognition model also incorporates the probabilities of different sequences of words. In the test speech samples, these probabilities are very simple—after the word “place,” for instance, the words “red,” “green,” “white” and “blue” each have a probability of 25 percent, and all other words have zero probability. Things get much more complicated in real-world speech recognition.
This structure forms what is called the language model, or the task grammar:
When the language model is combined with the beads-on-a-string model of phonemes, the result is a model of how the phoneme states may change in the course of a sentence, including silences that may or may not occur between words:
The full structure—including the probabilities for the sound spectra expected to appear in a spectrogram frame when the person is speaking a specific phoneme state—is known as a hidden Markov model, named after the 19th century Russian mathematician Andrey Andreyevich Markov.
A “Markov process” or “Markov chain” is a sequence of random states in which the probability of what comes at the next time step depends only on the current state and not on anything earlier. Thus the model only cares that the current state is phoneme state AY-1 in the word “white.” It doesn’t care if the previous state was W-3 or AY-1.
The Markov model is “hidden” because all that the computer gets to see directly are the sound spectra. The computer cannot see the phoneme state AY-1 directly—it can only estimate the probability of state AY-1 based on the observed sound spectra.
Researchers at IBM pioneered the use of hidden Markov models for speech recognition in the early 1970s. The recognition algorithm searches for the sequence of states in the speech model that best explains the observed sounds. Carrying out that search efficiently involves a process called belief propagation.
6. A dicey game: Belief propagation
A dice game illustrates how belief propagation works.
The game involves two dice. One is a fair die that is equally likely to land on each number from 1 to 6. The other die is loaded so that it always lands on the same number, but we do not know what that number is. The dice are rolled repeatedly, and we are told only their sum each time. Our task is to determine the numbers that the dice showed.
Before the first roll, we know nothing about which number the loaded die always shows. We believe each number to be equally likely. We can represent our beliefs for the two dice with a grid colored to show the uniform probabilities for each combination of numbers:
The first roll has a sum of 9. We know there are four ways to get that total with the two dice: 3+6, 4+5, 5+4, or 6+3. We now know the loaded die could not be a 1 or 2.
We update our belief about the loaded die: 1 and 2 have zero probability, and 3, 4, 5 and 6 each have 25 percent probability. We can propagate this belief forward to the next roll.
The second roll has a sum of 8. We learn nothing new about the loaded die because all the possibilities (3, 4, 5, 6) can combine with a fair die to make 8. We propagate our belief about the loaded die to the next roll.
The third roll has a sum of 5. We now know the loaded die must be either 3 or 4 because 5 and 6 are too big.
Again we update our beliefs accordingly and propagate them forward to the next dice roll.
The fourth roll has a sum of 10. We now know the loaded die must be 4 because the only other remaining possibility, 3, is too small.
As well as updating our beliefs to reflect this certainty about the loaded die, we can now propagate our beliefs backward and infer the number on the fair die in each of the rolls.
As well as illustrating the general idea of belief propagation, this example has also shown how the technique can separate two “signals” (the numbers on the dice) given some knowledge about how the signals behave over time (in this case, that the loaded die never changes).
7. The Viterbi algorithm—belief propagation for single talker speech recognition
Belief propagation for recognition of a single talker’s speech follows a similar general process, just with much more complicated probabilities. Each successive frame of a spectrogram is a new roll of the die. We carry forward beliefs built up from the earlier rolls. A procedure called the Viterbi algorithm carries out this search for the best explanation of the observed sound.
For instance, the sound in the frames so far may have been consistent with either R-EH or W-AY, with the former being more likely. We could represent the possible sequences of phoneme states and their probabilities in a diagram, with looping arrows to indicate states that have repeated across several frames of the spectrogram:
(We are simplifying things by not dividing the phonemes into states, such as R-1, R-2, R-3.) Suppose that the spectrum in the next frame has high probability if the phoneme is T, a moderate probability for phoneme B and a very low probability for phoneme D.
Our speech model has no words with the sounds W-AY-B or R-EH-B, but it does have “white” and “red.” Our updated beliefs give W-AY-T a high probability and R-EH-D a lower probability.
Suppose our language model also included W-AY-D (for instance, in the word “wide”), and suppose that we estimated that the speech W-AY-D was even less probable than R-EH-D given the observed audio in the frames so far. The Viterbi algorithm, however, only keeps the best sequence for each possible current phoneme state—in this case R-EH-D for D and W-AY-T for T. This trimming of possibilities makes the Viterbi algorithm an efficient way to search for the most likely speech.
Thus, we safely discard W-AY-D from further consideration and carry forward our beliefs about R-EH-D and W-AY-T to the next spectrogram frame.
8. Models for overlapping speech
When two words are spoken at once, the phonemes in each word may align in many possible ways. A speech separation algorithm has to find the most likely pair of sound signals to explain the full signal.
For instance, consider the case of two people saying the words “three” and “six.” The phonemes may happen to align in this fashion:
A brute force approach to analyze this overlapping speech is to combine two single-talker speech models to form a two-speaker model. Each state in this model consists of a pair of phoneme states. The overlapping words would be represented like this:
The arrows indicate the possible transitions among the two-phoneme states. Each transition has an associated probability. The heavier blue arrows show the actual sequence of states in the example.
The probabilities for sequences of these two-speaker states can be computed from the probabilities in the single speaker models and from an “interaction function” that describes how the individual audio signals combine. The Viterbi algorithm can search for the most probable sequence of two-speaker states to explain the total audio signal.
This approach is feasible for combining two small speech models, such as with the test samples of two people using a very limited vocabulary. Our algorithm used this approach in the speech separation challenge and performed better than human listeners—the first speech separation algorithm to do so. Such superhuman performance by a computer is an extremely rare accomplishment in any field of audio or visual perception.
Yet a critically important challenge remained: to find an algorithm that would scale up and remain efficient when dealing with larger speech models and more talkers.
9. The combinatorial problem
Trying to analyze overlapping speech by using a brute force combination of speaker models becomes exponentially more time consuming as the number of talkers increases. The graph shows the time it would take a modern PC to analyze a single second of sound produced by a number of people speaking at once. Here we assume each second of audio is divided into 50 spectrogram frames, and an individual’s speech in any given spectrogram frame consists of one of 10,000 possible sounds. The n-talker model must therefore handle 10,000n possible sound combinations in each frame. We also assume the PC can analyze a million sound combinations per second.
Clearly this brute force approach is hopeless.
10. The bottom-up approach
An alternative to modeling multi-talker speech completely is to analyze the combined sound from the bottom up to find regions of the spectrogram that are produced by the same speaker. This approach, known as computational auditory scene analysis (CASA), uses features of the spectrogram to identify what sounds should be grouped together. For instance, two sounds that start at the same instant in time probably come from a single person, even if they involve several different frequency bands. Similarly, harmonics with the same fundamental frequency (the same “pitch”) are most likely from the same speaker.
Such methods are often combined with missing feature approaches, which use a single-talker model to recognize the speech based on regions of the spectrogram that are dominated by the target talker, while ignoring areas dominated by other sounds. Missing feature approaches can also be used to help piece together the regions identified by bottom-up methods.
Using single-talker recognition in this way avoids the problem of searching all combinations of speech sounds. The sparseness of speech makes this kind of approach practical: even when two or three people are talking at once, in many small regions of a spectrogram the sound is dominated by just one of the talkers.
An advantage of bottom-up and missing feature approaches is that they make fewer assumptions about the sounds that may be masking the target speaker and so they may generalize better to different kinds of noisy environments.
The main limitation of bottom-up approaches is that the regions dominated by one speaker are rarely simple, because overlapping speech signals are generally highly intertwined over frequency and time. What is lost is the ability to use knowledge about the speech patterns of both the target and the background speakers to help disentangle sounds that have been interleaved in this way.
11. Loopy belief propagation
Our algorithm, like those of many groups, uses the top-down approach, which tries to model the full sound from multiple talkers without first picking out special regions in the spectrogram. A variation of belief propagation called loopy belief propagation plays an important role in speeding up the search through all the possibilities. A method for solving sudoku puzzles illustrates how loopy belief propagation works.
To keep things simple, consider a 4×4 sudoku instead of the usual 9×9 version.
The goal of sudoku is to find numbers for all the empty squares so that each row, column and 2×2 sub-box contains all the numbers from 1 to 4. Systematically trying out every possible combination would be very slow, similar to doing the full Viterbi algorithm for multiple talkers. (We are thinking of a person solving the puzzle with pencil and paper. A computer search through all possibilities is actually not excessive even with 9×9 puzzles.)
Instead the requirements for rows, columns and sub-boxes can be considered one at a time, updating the beliefs about the possible numbers in each square at each step. We start with all the possibilities written in the unsolved squares.
To enforce the row constraints, we cross out those numbers that match the known numbers in each row. One way to think of this step is that the squares with known numbers send out messages—“you cannot be my number”—to all the other squares in the same row.
Next we do the same for the columns.
At this point we have solved some of the squares, and we update the grid accordingly.
Then we proceed to send out box messages, eliminating numbers that match known numbers in each of the 2×2 sub-boxes.
Again we update our beliefs about solved numbers.
The problem is not completely solved after imposing all these requirements once, so we go back and consider them in sequence again in the light of our updated beliefs. Sending out row messages again eliminates new numbers because of information gained in the column and sub-box steps.
This simple example is now solved.
The general method is called loopy belief propagation because the requirements (rows, columns, sub-boxes) are intertwined.
Applying the constraints iteratively greatly simplified the search for the solution. The row constraints, for instance, were independent of one another, so the row messages from each known number could be handled independently.
The algorithm described here cannot solve every sudoku puzzle, but just slightly more complicated iterative methods do succeed for all of them. Applied to more complicated problems, however, iterative approaches often find only an approximation to the best solution.
12. Loopy belief propagation for superhuman two-speaker separation
This iterative approach can be applied to speech separation by alternating between the different talkers. Consider the case of two people, talker A and talker B. We form an initial estimate of the likely speech of each talker. This estimate may be far from the best solution because we do not try to consider every possibility (that would take too long). Now we hold our estimate of talker B’s speech fixed and use the Viterbi algorithm to find the best estimate of talker A. Then we hold that revised estimate for talker A fixed and search for a better estimate of talker B’s speech. The process continues, alternating between talker A and talker B until our estimates stop changing. For three or more speakers, we rotate through them sequentially.
Even if this search takes many iterations, the process is much faster than evaluating both speakers at once with a multi-talker model. For example, if there are six speakers, each with 10,000 sounds, each iteration must evaluate 60,000 sound combinations—a lot less than the trillion trillion combinations in the full six-talker model!
A common metaphor for these kinds of searches is a landscape. The correct solution to the problem is at the bottom of the lowest valley. An exhaustive search would examine the height of every point on the landscape, but that would take too long. The iterative approach involves looking along a line running north-south through the current best choice, and moving to the lowest point along that line. Then repeat with the line running east-west through the new point.
Iterating between the speakers in this fashion is fast and quite effective. In the speech separation challenge this approach outperformed most other algorithms, but it still fell short of human performance.
A potential pitfall comes up if the search gets stuck in a valley that is deeper than anything lying due north, south, east or west, but the deepest valley (the correct answer) is out of reach on the other side of a big ridge. Which valley the search ends in may vary depending on where the search happened to start out.
This problem can be alleviated by keeping track of multiple locations on the landscape, with an estimated probability for each. Loopy belief propagation updates and propagates these probabilities through each iteration.
We developed an algorithm that uses loopy belief propagation in this fashion to decode what each person is saying iteratively, without having to search through all the possible combinations of everyone’s words. The technique only speeds up a part of the speech separation problem, however, because it relies on having estimates of the most likely words the speakers are saying. Obtaining those estimates still requires searching through a vast number of possible sound combinations that the talkers could have produced. The loopy belief propagation makes the speech separation faster than the brute force multi-speaker search and far more accurate than simple iteration between speakers, but the method still would consume ridiculous amounts of computing time for three or more speakers.
A final innovation is required to control this combinatorial explosion.
13. The masking function to the rescue
The way that the spectra of two speech signals combine to form the spectrum of the total sound is complicated. Because of the structure of speech, however, the louder of two sounds at a given frequency typically has significantly more power than the quieter sound. The total sound at each frequency is roughly equal to the louder sound because the quieter sounds usually contribute too little to be significant.
The figure shows this effect in the spectrum of sound from two overlapping speakers for a single 40-millisecond time frame (one time slice from a spectrogram). Talker A, a female, is louder over most of the higher frequencies whereas talker B, a male, is louder over most of the lower frequencies. In most frequency regions, the spectrum of the dominant speaker is very close to the spectrum of the mixture.
We can exploit this feature by using a so-called masking function. This function indicates which speaker dominates at each frequency band in a frame. The masking function for the whole spectrogram divides the data into regions in which talker A dominates and regions in which talker B dominates. If we know the masking function, we can evaluate both talkers’ speech independently: each talker’s speech must match the total sound where that talker dominates and it must be quieter than the total sound elsewhere. Finding the best estimate of A’s speech no longer depends on our estimate of B’s speech except for the information carried by the masking function.
Recently we discovered how to separate speech efficiently using these masking functions. We do so by alternating between two steps. In one step we use our current beliefs about the talkers’ speech to estimate beliefs (probabilities) about the masking functions. In the other, we use the masking functions to update our beliefs about the speech of all the speakers.
The ability to carry along many possible masking functions through the iterations greatly reduces the chances of getting stuck in the wrong valley of the landscape.
14. Multiple talkers
The new algorithm based on speaker masking makes it possible to separate more complicated speech mixtures that were previously beyond the reach of computerized analysis. For example, the algorithm has successfully extracted the speech of one talker from the babble of three other talkers:
- Click arrow for audio
- MP3 file
The mixed speech consists of people saying these sentences:
Lay white at K 5 again.
Bin blue by M zero now.
Set green in M 7 please.
Lay green with S 7 please
Our algorithm’s reconstruction of the third of these speakers sounds is very clear:
- Click arrow for audio
- MP3 file
The task of discerning that talker is made all the harder because some of the other people are speaking at a louder volume. Despite that challenge, our algorithm reconstructs the target speech accurately enough to recognize the words correctly.
These spectrograms illustrate the mixed speech, the actual target speech and our reconstruction of it:
The color scale goes from low power (blue) to high power (red).
The masking function that makes this speech separation possible can also be depicted as a kind of spectrogram. In these images, a white area indicates that the target speaker dominates the audio signal in that frequency band at that time and black areas indicate where sound from the other speakers dominates.
By the end of the iteration process, our algorithm’s estimate of the masking function based on its analysis of the mixed audio is quite close to the actual masking function computed using the true audio signals that made up the mixture.
This approach to speech separation has applications far more important than audio surveillance or improvement of speech recognition in general. The basic problem of separating signals is fundamental, appearing in a broad spectrum of applications ranging from brain imaging analysis to development of intelligent robots. It will be some time before a mechanical bartender takes your cocktail order over the chatter of a festive party, but the era of robust robotic audio interaction has now begun.