10. The bottom-up approach
An alternative to modeling multi-talker speech completely is to analyze the combined sound from the bottom up to find regions of the spectrogram that are produced by the same speaker. This approach, known as computational auditory scene analysis (CASA), uses features of the spectrogram to identify what sounds should be grouped together. For instance, two sounds that start at the same instant in time probably come from a single person, even if they involve several different frequency bands. Similarly, harmonics with the same fundamental frequency (the same “pitch”) are most likely from the same speaker.
Such methods are often combined with missing feature approaches, which use a single-talker model to recognize the speech based on regions of the spectrogram that are dominated by the target talker, while ignoring areas dominated by other sounds. Missing feature approaches can also be used to help piece together the regions identified by bottom-up methods.
Using single-talker recognition in this way avoids the problem of searching all combinations of speech sounds. The sparseness of speech makes this kind of approach practical: even when two or three people are talking at once, in many small regions of a spectrogram the sound is dominated by just one of the talkers.
An advantage of bottom-up and missing feature approaches is that they make fewer assumptions about the sounds that may be masking the target speaker and so they may generalize better to different kinds of noisy environments.
The main limitation of bottom-up approaches is that the regions dominated by one speaker are rarely simple, because overlapping speech signals are generally highly intertwined over frequency and time. What is lost is the ability to use knowledge about the speech patterns of both the target and the background speakers to help disentangle sounds that have been interleaved in this way.
11. Loopy belief propagation
Our algorithm, like those of many groups, uses the top-down approach, which tries to model the full sound from multiple talkers without first picking out special regions in the spectrogram. A variation of belief propagation called loopy belief propagation plays an important role in speeding up the search through all the possibilities. A method for solving sudoku puzzles illustrates how loopy belief propagation works.
To keep things simple, consider a 4×4 sudoku instead of the usual 9×9 version.
The goal of sudoku is to find numbers for all the empty squares so that each row, column and 2×2 sub-box contains all the numbers from 1 to 4. Systematically trying out every possible combination would be very slow, similar to doing the full Viterbi algorithm for multiple talkers. (We are thinking of a person solving the puzzle with pencil and paper. A computer search through all possibilities is actually not excessive even with 9×9 puzzles.)
Instead the requirements for rows, columns and sub-boxes can be considered one at a time, updating the beliefs about the possible numbers in each square at each step. We start with all the possibilities written in the unsolved squares.
To enforce the row constraints, we cross out those numbers that match the known numbers in each row. One way to think of this step is that the squares with known numbers send out messages—“you cannot be my number”—to all the other squares in the same row.
Next we do the same for the columns.
At this point we have solved some of the squares, and we update the grid accordingly.
Then we proceed to send out box messages, eliminating numbers that match known numbers in each of the 2×2 sub-boxes.
Again we update our beliefs about solved numbers.
The problem is not completely solved after imposing all these requirements once, so we go back and consider them in sequence again in the light of our updated beliefs. Sending out row messages again eliminates new numbers because of information gained in the column and sub-box steps.
This simple example is now solved.
The general method is called loopy belief propagation because the requirements (rows, columns, sub-boxes) are intertwined.
Applying the constraints iteratively greatly simplified the search for the solution. The row constraints, for instance, were independent of one another, so the row messages from each known number could be handled independently.
The algorithm described here cannot solve every sudoku puzzle, but just slightly more complicated iterative methods do succeed for all of them. Applied to more complicated problems, however, iterative approaches often find only an approximation to the best solution.
12. Loopy belief propagation for superhuman two-speaker separation
This iterative approach can be applied to speech separation by alternating between the different talkers. Consider the case of two people, talker A and talker B. We form an initial estimate of the likely speech of each talker. This estimate may be far from the best solution because we do not try to consider every possibility (that would take too long). Now we hold our estimate of talker B’s speech fixed and use the Viterbi algorithm to find the best estimate of talker A. Then we hold that revised estimate for talker A fixed and search for a better estimate of talker B’s speech. The process continues, alternating between talker A and talker B until our estimates stop changing. For three or more speakers, we rotate through them sequentially.
Even if this search takes many iterations, the process is much faster than evaluating both speakers at once with a multi-talker model. For example, if there are six speakers, each with 10,000 sounds, each iteration must evaluate 60,000 sound combinations—a lot less than the trillion trillion combinations in the full six-talker model!
A common metaphor for these kinds of searches is a landscape. The correct solution to the problem is at the bottom of the lowest valley. An exhaustive search would examine the height of every point on the landscape, but that would take too long. The iterative approach involves looking along a line running north-south through the current best choice, and moving to the lowest point along that line. Then repeat with the line running east-west through the new point.
Iterating between the speakers in this fashion is fast and quite effective. In the speech separation challenge this approach outperformed most other algorithms, but it still fell short of human performance.
A potential pitfall comes up if the search gets stuck in a valley that is deeper than anything lying due north, south, east or west, but the deepest valley (the correct answer) is out of reach on the other side of a big ridge. Which valley the search ends in may vary depending on where the search happened to start out.
This problem can be alleviated by keeping track of multiple locations on the landscape, with an estimated probability for each. Loopy belief propagation updates and propagates these probabilities through each iteration.
We developed an algorithm that uses loopy belief propagation in this fashion to decode what each person is saying iteratively, without having to search through all the possible combinations of everyone’s words. The technique only speeds up a part of the speech separation problem, however, because it relies on having estimates of the most likely words the speakers are saying. Obtaining those estimates still requires searching through a vast number of possible sound combinations that the talkers could have produced. The loopy belief propagation makes the speech separation faster than the brute force multi-speaker search and far more accurate than simple iteration between speakers, but the method still would consume ridiculous amounts of computing time for three or more speakers.
A final innovation is required to control this combinatorial explosion.