10. The bottom-up approach
An alternative to modeling multi-talker speech completely is to analyze the combined sound from the bottom up to find regions of the spectrogram that are produced by the same speaker. This approach, known as computational auditory scene analysis (CASA), uses features of the spectrogram to identify what sounds should be grouped together. For instance, two sounds that start at the same instant in time probably come from a single person, even if they involve several different frequency bands. Similarly, harmonics with the same fundamental frequency (the same “pitch”) are most likely from the same speaker.
Such methods are often combined with missing feature approaches, which use a single-talker model to recognize the speech based on regions of the spectrogram that are dominated by the target talker, while ignoring areas dominated by other sounds. Missing feature approaches can also be used to help piece together the regions identified by bottom-up methods.
Using single-talker recognition in this way avoids the problem of searching all combinations of speech sounds. The sparseness of speech makes this kind of approach practical: even when two or three people are talking at once, in many small regions of a spectrogram the sound is dominated by just one of the talkers.
An advantage of bottom-up and missing feature approaches is that they make fewer assumptions about the sounds that may be masking the target speaker and so they may generalize better to different kinds of noisy environments.
The main limitation of bottom-up approaches is that the regions dominated by one speaker are rarely simple, because overlapping speech signals are generally highly intertwined over frequency and time. What is lost is the ability to use knowledge about the speech patterns of both the target and the background speakers to help disentangle sounds that have been interleaved in this way.
11. Loopy belief propagation
Our algorithm, like those of many groups, uses the top-down approach, which tries to model the full sound from multiple talkers without first picking out special regions in the spectrogram. A variation of belief propagation called loopy belief propagation plays an important role in speeding up the search through all the possibilities. A method for solving sudoku puzzles illustrates how loopy belief propagation works.
To keep things simple, consider a 4×4 sudoku instead of the usual 9×9 version.

The goal of sudoku is to find numbers for all the empty squares so that each row, column and 2×2 sub-box contains all the numbers from 1 to 4. Systematically trying out every possible combination would be very slow, similar to doing the full Viterbi algorithm for multiple talkers. (We are thinking of a person solving the puzzle with pencil and paper. A computer search through all possibilities is actually not excessive even with 9×9 puzzles.)
Instead the requirements for rows, columns and sub-boxes can be considered one at a time, updating the beliefs about the possible numbers in each square at each step. We start with all the possibilities written in the unsolved squares.

To enforce the row constraints, we cross out those numbers that match the known numbers in each row. One way to think of this step is that the squares with known numbers send out messages—“you cannot be my number”—to all the other squares in the same row.

Next we do the same for the columns.

At this point we have solved some of the squares, and we update the grid accordingly.

Then we proceed to send out box messages, eliminating numbers that match known numbers in each of the 2×2 sub-boxes.

Again we update our beliefs about solved numbers.

The problem is not completely solved after imposing all these requirements once, so we go back and consider them in sequence again in the light of our updated beliefs. Sending out row messages again eliminates new numbers because of information gained in the column and sub-box steps.

This simple example is now solved.

The general method is called loopy belief propagation because the requirements (rows, columns, sub-boxes) are intertwined.
Applying the constraints iteratively greatly simplified the search for the solution. The row constraints, for instance, were independent of one another, so the row messages from each known number could be handled independently.
The algorithm described here cannot solve every sudoku puzzle, but just slightly more complicated iterative methods do succeed for all of them. Applied to more complicated problems, however, iterative approaches often find only an approximation to the best solution.
12. Loopy belief propagation for superhuman two-speaker separation
This iterative approach can be applied to speech separation by alternating between the different talkers. Consider the case of two people, talker A and talker B. We form an initial estimate of the likely speech of each talker. This estimate may be far from the best solution because we do not try to consider every possibility (that would take too long). Now we hold our estimate of talker B’s speech fixed and use the Viterbi algorithm to find the best estimate of talker A. Then we hold that revised estimate for talker A fixed and search for a better estimate of talker B’s speech. The process continues, alternating between talker A and talker B until our estimates stop changing. For three or more speakers, we rotate through them sequentially.
Even if this search takes many iterations, the process is much faster than evaluating both speakers at once with a multi-talker model. For example, if there are six speakers, each with 10,000 sounds, each iteration must evaluate 60,000 sound combinations—a lot less than the trillion trillion combinations in the full six-talker model!
A common metaphor for these kinds of searches is a landscape. The correct solution to the problem is at the bottom of the lowest valley. An exhaustive search would examine the height of every point on the landscape, but that would take too long. The iterative approach involves looking along a line running north-south through the current best choice, and moving to the lowest point along that line. Then repeat with the line running east-west through the new point.


Iterating between the speakers in this fashion is fast and quite effective. In the speech separation challenge this approach outperformed most other algorithms, but it still fell short of human performance.
A potential pitfall comes up if the search gets stuck in a valley that is deeper than anything lying due north, south, east or west, but the deepest valley (the correct answer) is out of reach on the other side of a big ridge. Which valley the search ends in may vary depending on where the search happened to start out.

This problem can be alleviated by keeping track of multiple locations on the landscape, with an estimated probability for each. Loopy belief propagation updates and propagates these probabilities through each iteration.
We developed an algorithm that uses loopy belief propagation in this fashion to decode what each person is saying iteratively, without having to search through all the possible combinations of everyone’s words. The technique only speeds up a part of the speech separation problem, however, because it relies on having estimates of the most likely words the speakers are saying. Obtaining those estimates still requires searching through a vast number of possible sound combinations that the talkers could have produced. The loopy belief propagation makes the speech separation faster than the brute force multi-speaker search and far more accurate than simple iteration between speakers, but the method still would consume ridiculous amounts of computing time for three or more speakers.
A final innovation is required to control this combinatorial explosion.



See what we're tweeting about


8 Comments
Add CommentOne thing the article didn't talk about (or I didn't catch it) was that we humans hear in sterio. Or at least those of us that have 2 good ears do. I don't so I have a very hard time distinguishing between different voices the same as computers seem to. Humans with sterio sound capability can lock in on a persons voice in 3D space and filter out anything outside of that space. A good way to test that is to have someone listen to people talking in a group of people and comparing what they heard with a mono recording of that same group of people from the same location. I think that a person would have the same problem with discriminating between different people talking at the same time from the recording as the computer does.
Reply | Report Abuse | Link to thisAt least that has been my experiance.
Joenn
Joenn, you're right my short summary (Solving the Cocktail Party Problem) didn't address this issue. In this longer piece (Audio Alchemy) the researchers spell out that they're looking at the "monoaural" case where there's only one microphone and thus no directional information about the talkers.
Reply | Report Abuse | Link to thisThere is a new separation "challenge" on at the moment that focuses on the task of extracting speech from recordings made "in a noisy living room" using two microphones: http://www.dcs.shef.ac.uk/spandh/chime/challenge
Results are going to be discussed at a conference in September.
As a native Japanese/English bilingual speaker, I've always thought it was easier to follow multiple overlapping conversations when they were spoken in different languages. I don't know whether this is because the spectral properties of the languages are different, or because I'm using different parts of my brain to understand the different languages simultaneously, but I would be interested to find out.
Reply | Report Abuse | Link to thisThat's an interesting comment. In our work we have found that differences such as gender, pitch and volume made it easier for our algorithm to separate the audio. Probably this is not a property of the algorithm, but rather of how the problem becomes easier the more different the signals are. If two twins are talking at the same time and volume and pause at the same time the computer can't possibly know who continues what -- short of using some language information.
Reply | Report Abuse | Link to thisExcellent article...very much enjoyed...
Reply | Report Abuse | Link to thisI am extremely hard of hearing. I use hearing aids and must give an external FM transmitter to a close companion in order to have a conversation with that companion in a crowded dinning room. My wife on the other hand has excellent hearing and discrimination. I am always astounded at her ability to discriminate and "listen in" to someone speaking several tables distant.
Reply | Report Abuse | Link to thisI especially have trouble hearing, and maybe even thinking, in a cocktail party situation. I wear two hearing aids. It's always seemed like if they could communicate with each other, they could recognise what sounds are in phase, therefore coming from directly in front of me, and amplify those sounds over others. These days they should be able to build the needed computer systems into the frames of my glasses.
Reply | Report Abuse | Link to thisI experienced (total) sudden hearing loss in one ear 2 years ago. I can now no longer "place" sounds: I can't tell where a noise is coming from. And in noisy environments, conversation is almost impossible. In noisy restaurants, even if I'm directly looking at the person speaking, I have great difficulty understanding what is said. My thoughts: with 2 ears, our brain constructs an "aural dome" of our environment, giving us a lot of information about our surroundings. Without these directional cues, it's all just noise -- which is my situation now. Seems to me solving the cocktail party problem could be vastly improved by a 2-microphone (i.e., stereo) recording. And breaking up the noise by separating it into tracks from discrete directions would seem to go a long way toward splitting the "noise" into multiple speech sources.
Reply | Report Abuse | Link to this