7. The Viterbi algorithm—belief propagation for single talker speech recognition
Belief propagation for recognition of a single talker’s speech follows a similar general process, just with much more complicated probabilities. Each successive frame of a spectrogram is a new roll of the die. We carry forward beliefs built up from the earlier rolls. A procedure called the Viterbi algorithm carries out this search for the best explanation of the observed sound.
For instance, the sound in the frames so far may have been consistent with either R-EH or W-AY, with the former being more likely. We could represent the possible sequences of phoneme states and their probabilities in a diagram, with looping arrows to indicate states that have repeated across several frames of the spectrogram:
(We are simplifying things by not dividing the phonemes into states, such as R-1, R-2, R-3.) Suppose that the spectrum in the next frame has high probability if the phoneme is T, a moderate probability for phoneme B and a very low probability for phoneme D.
Our speech model has no words with the sounds W-AY-B or R-EH-B, but it does have “white” and “red.” Our updated beliefs give W-AY-T a high probability and R-EH-D a lower probability.
Suppose our language model also included W-AY-D (for instance, in the word “wide”), and suppose that we estimated that the speech W-AY-D was even less probable than R-EH-D given the observed audio in the frames so far. The Viterbi algorithm, however, only keeps the best sequence for each possible current phoneme state—in this case R-EH-D for D and W-AY-T for T. This trimming of possibilities makes the Viterbi algorithm an efficient way to search for the most likely speech.
Thus, we safely discard W-AY-D from further consideration and carry forward our beliefs about R-EH-D and W-AY-T to the next spectrogram frame.
8. Models for overlapping speech
When two words are spoken at once, the phonemes in each word may align in many possible ways. A speech separation algorithm has to find the most likely pair of sound signals to explain the full signal.
For instance, consider the case of two people saying the words “three” and “six.” The phonemes may happen to align in this fashion:
A brute force approach to analyze this overlapping speech is to combine two single-talker speech models to form a two-speaker model. Each state in this model consists of a pair of phoneme states. The overlapping words would be represented like this:
The arrows indicate the possible transitions among the two-phoneme states. Each transition has an associated probability. The heavier blue arrows show the actual sequence of states in the example.
The probabilities for sequences of these two-speaker states can be computed from the probabilities in the single speaker models and from an “interaction function” that describes how the individual audio signals combine. The Viterbi algorithm can search for the most probable sequence of two-speaker states to explain the total audio signal.
This approach is feasible for combining two small speech models, such as with the test samples of two people using a very limited vocabulary. Our algorithm used this approach in the speech separation challenge and performed better than human listeners—the first speech separation algorithm to do so. Such superhuman performance by a computer is an extremely rare accomplishment in any field of audio or visual perception.
Yet a critically important challenge remained: to find an algorithm that would scale up and remain efficient when dealing with larger speech models and more talkers.
9. The combinatorial problem
Trying to analyze overlapping speech by using a brute force combination of speaker models becomes exponentially more time consuming as the number of talkers increases. The graph shows the time it would take a modern PC to analyze a single second of sound produced by a number of people speaking at once. Here we assume each second of audio is divided into 50 spectrogram frames, and an individual’s speech in any given spectrogram frame consists of one of 10,000 possible sounds. The n-talker model must therefore handle 10,000n possible sound combinations in each frame. We also assume the PC can analyze a million sound combinations per second.
Clearly this brute force approach is hopeless.