7. The Viterbi algorithm—belief propagation for single talker speech recognition
Belief propagation for recognition of a single talker’s speech follows a similar general process, just with much more complicated probabilities. Each successive frame of a spectrogram is a new roll of the die. We carry forward beliefs built up from the earlier rolls. A procedure called the Viterbi algorithm carries out this search for the best explanation of the observed sound.
For instance, the sound in the frames so far may have been consistent with either R-EH or W-AY, with the former being more likely. We could represent the possible sequences of phoneme states and their probabilities in a diagram, with looping arrows to indicate states that have repeated across several frames of the spectrogram:

(We are simplifying things by not dividing the phonemes into states, such as R-1, R-2, R-3.) Suppose that the spectrum in the next frame has high probability if the phoneme is T, a moderate probability for phoneme B and a very low probability for phoneme D.

Our speech model has no words with the sounds W-AY-B or R-EH-B, but it does have “white” and “red.” Our updated beliefs give W-AY-T a high probability and R-EH-D a lower probability.

Suppose our language model also included W-AY-D (for instance, in the word “wide”), and suppose that we estimated that the speech W-AY-D was even less probable than R-EH-D given the observed audio in the frames so far. The Viterbi algorithm, however, only keeps the best sequence for each possible current phoneme state—in this case R-EH-D for D and W-AY-T for T. This trimming of possibilities makes the Viterbi algorithm an efficient way to search for the most likely speech.
Thus, we safely discard W-AY-D from further consideration and carry forward our beliefs about R-EH-D and W-AY-T to the next spectrogram frame.
8. Models for overlapping speech
When two words are spoken at once, the phonemes in each word may align in many possible ways. A speech separation algorithm has to find the most likely pair of sound signals to explain the full signal.
For instance, consider the case of two people saying the words “three” and “six.” The phonemes may happen to align in this fashion:

A brute force approach to analyze this overlapping speech is to combine two single-talker speech models to form a two-speaker model. Each state in this model consists of a pair of phoneme states. The overlapping words would be represented like this:

The arrows indicate the possible transitions among the two-phoneme states. Each transition has an associated probability. The heavier blue arrows show the actual sequence of states in the example.
The probabilities for sequences of these two-speaker states can be computed from the probabilities in the single speaker models and from an “interaction function” that describes how the individual audio signals combine. The Viterbi algorithm can search for the most probable sequence of two-speaker states to explain the total audio signal.
This approach is feasible for combining two small speech models, such as with the test samples of two people using a very limited vocabulary. Our algorithm used this approach in the speech separation challenge and performed better than human listeners—the first speech separation algorithm to do so. Such superhuman performance by a computer is an extremely rare accomplishment in any field of audio or visual perception.
Yet a critically important challenge remained: to find an algorithm that would scale up and remain efficient when dealing with larger speech models and more talkers.
9. The combinatorial problem

Trying to analyze overlapping speech by using a brute force combination of speaker models becomes exponentially more time consuming as the number of talkers increases. The graph shows the time it would take a modern PC to analyze a single second of sound produced by a number of people speaking at once. Here we assume each second of audio is divided into 50 spectrogram frames, and an individual’s speech in any given spectrogram frame consists of one of 10,000 possible sounds. The n-talker model must therefore handle 10,000n possible sound combinations in each frame. We also assume the PC can analyze a million sound combinations per second.
Clearly this brute force approach is hopeless.



See what we're tweeting about




8 Comments
Add CommentOne thing the article didn't talk about (or I didn't catch it) was that we humans hear in sterio. Or at least those of us that have 2 good ears do. I don't so I have a very hard time distinguishing between different voices the same as computers seem to. Humans with sterio sound capability can lock in on a persons voice in 3D space and filter out anything outside of that space. A good way to test that is to have someone listen to people talking in a group of people and comparing what they heard with a mono recording of that same group of people from the same location. I think that a person would have the same problem with discriminating between different people talking at the same time from the recording as the computer does.
Reply | Report Abuse | Link to thisAt least that has been my experiance.
Joenn
Joenn, you're right my short summary (Solving the Cocktail Party Problem) didn't address this issue. In this longer piece (Audio Alchemy) the researchers spell out that they're looking at the "monoaural" case where there's only one microphone and thus no directional information about the talkers.
Reply | Report Abuse | Link to thisThere is a new separation "challenge" on at the moment that focuses on the task of extracting speech from recordings made "in a noisy living room" using two microphones: http://www.dcs.shef.ac.uk/spandh/chime/challenge
Results are going to be discussed at a conference in September.
As a native Japanese/English bilingual speaker, I've always thought it was easier to follow multiple overlapping conversations when they were spoken in different languages. I don't know whether this is because the spectral properties of the languages are different, or because I'm using different parts of my brain to understand the different languages simultaneously, but I would be interested to find out.
Reply | Report Abuse | Link to thisThat's an interesting comment. In our work we have found that differences such as gender, pitch and volume made it easier for our algorithm to separate the audio. Probably this is not a property of the algorithm, but rather of how the problem becomes easier the more different the signals are. If two twins are talking at the same time and volume and pause at the same time the computer can't possibly know who continues what -- short of using some language information.
Reply | Report Abuse | Link to thisExcellent article...very much enjoyed...
Reply | Report Abuse | Link to thisI am extremely hard of hearing. I use hearing aids and must give an external FM transmitter to a close companion in order to have a conversation with that companion in a crowded dinning room. My wife on the other hand has excellent hearing and discrimination. I am always astounded at her ability to discriminate and "listen in" to someone speaking several tables distant.
Reply | Report Abuse | Link to thisI especially have trouble hearing, and maybe even thinking, in a cocktail party situation. I wear two hearing aids. It's always seemed like if they could communicate with each other, they could recognise what sounds are in phase, therefore coming from directly in front of me, and amplify those sounds over others. These days they should be able to build the needed computer systems into the frames of my glasses.
Reply | Report Abuse | Link to thisI experienced (total) sudden hearing loss in one ear 2 years ago. I can now no longer "place" sounds: I can't tell where a noise is coming from. And in noisy environments, conversation is almost impossible. In noisy restaurants, even if I'm directly looking at the person speaking, I have great difficulty understanding what is said. My thoughts: with 2 ears, our brain constructs an "aural dome" of our environment, giving us a lot of information about our surroundings. Without these directional cues, it's all just noise -- which is my situation now. Seems to me solving the cocktail party problem could be vastly improved by a 2-microphone (i.e., stereo) recording. And breaking up the noise by separating it into tracks from discrete directions would seem to go a long way toward splitting the "noise" into multiple speech sources.
Reply | Report Abuse | Link to this