4. Beads on a string model of speech
Computers encode the structure of speech in models, which typically represent thousands of words, each word built from a multitude of possible sounds. The most commonly used speech model is the so called beads-on-a-string model.
The pronunciation of a word is represented as a series of basic sound units called phonemes. For instance, “white” can be represented by the phoneme sequence “W-AY-T” (“AY” is used to represent the sound of the letter “i” in “white”). Each phoneme is further divided into a sequence of states—the “beads”—which represent how the sound power spectrum changes over the duration of a phoneme.

The duration of each phoneme state will vary depending on the speaker and the manner of speech. Speech models encode this variability in duration using probabilities: If the person is saying “white” and the sound in one time frame is phoneme state AY-1, the model specifies the probability of the sound switching to state AY-2 in the next frame.
Probabilities also enter the picture because each phoneme state can correspond to many different possible spectra in a frame of the spectrogram, again depending on all the different ways that someone could produce the sound—tonal inflections, the stress on the syllable, the overall pitch and timbre of the voice and so on.
5. Hidden Markov model
The full speech recognition model also incorporates the probabilities of different sequences of words. In the test speech samples, these probabilities are very simple—after the word “place,” for instance, the words “red,” “green,” “white” and “blue” each have a probability of 25 percent, and all other words have zero probability. Things get much more complicated in real-world speech recognition.
This structure forms what is called the language model, or the task grammar:

When the language model is combined with the beads-on-a-string model of phonemes, the result is a model of how the phoneme states may change in the course of a sentence, including silences that may or may not occur between words:

The full structure—including the probabilities for the sound spectra expected to appear in a spectrogram frame when the person is speaking a specific phoneme state—is known as a hidden Markov model, named after the 19th century Russian mathematician Andrey Andreyevich Markov.
A “Markov process” or “Markov chain” is a sequence of random states in which the probability of what comes at the next time step depends only on the current state and not on anything earlier. Thus the model only cares that the current state is phoneme state AY-1 in the word “white.” It doesn’t care if the previous state was W-3 or AY-1.
The Markov model is “hidden” because all that the computer gets to see directly are the sound spectra. The computer cannot see the phoneme state AY-1 directly—it can only estimate the probability of state AY-1 based on the observed sound spectra.
Researchers at IBM pioneered the use of hidden Markov models for speech recognition in the early 1970s. The recognition algorithm searches for the sequence of states in the speech model that best explains the observed sounds. Carrying out that search efficiently involves a process called belief propagation.
6. A dicey game: Belief propagation
A dice game illustrates how belief propagation works.
The game involves two dice. One is a fair die that is equally likely to land on each number from 1 to 6. The other die is loaded so that it always lands on the same number, but we do not know what that number is. The dice are rolled repeatedly, and we are told only their sum each time. Our task is to determine the numbers that the dice showed.
Before the first roll, we know nothing about which number the loaded die always shows. We believe each number to be equally likely. We can represent our beliefs for the two dice with a grid colored to show the uniform probabilities for each combination of numbers:

The first roll has a sum of 9. We know there are four ways to get that total with the two dice: 3+6, 4+5, 5+4, or 6+3. We now know the loaded die could not be a 1 or 2.

We update our belief about the loaded die: 1 and 2 have zero probability, and 3, 4, 5 and 6 each have 25 percent probability. We can propagate this belief forward to the next roll.

The second roll has a sum of 8. We learn nothing new about the loaded die because all the possibilities (3, 4, 5, 6) can combine with a fair die to make 8. We propagate our belief about the loaded die to the next roll.

The third roll has a sum of 5. We now know the loaded die must be either 3 or 4 because 5 and 6 are too big.

Again we update our beliefs accordingly and propagate them forward to the next dice roll.

The fourth roll has a sum of 10. We now know the loaded die must be 4 because the only other remaining possibility, 3, is too small.

As well as updating our beliefs to reflect this certainty about the loaded die, we can now propagate our beliefs backward and infer the number on the fair die in each of the rolls.

As well as illustrating the general idea of belief propagation, this example has also shown how the technique can separate two “signals” (the numbers on the dice) given some knowledge about how the signals behave over time (in this case, that the loaded die never changes).



See what we're tweeting about




8 Comments
Add CommentOne thing the article didn't talk about (or I didn't catch it) was that we humans hear in sterio. Or at least those of us that have 2 good ears do. I don't so I have a very hard time distinguishing between different voices the same as computers seem to. Humans with sterio sound capability can lock in on a persons voice in 3D space and filter out anything outside of that space. A good way to test that is to have someone listen to people talking in a group of people and comparing what they heard with a mono recording of that same group of people from the same location. I think that a person would have the same problem with discriminating between different people talking at the same time from the recording as the computer does.
Reply | Report Abuse | Link to thisAt least that has been my experiance.
Joenn
Joenn, you're right my short summary (Solving the Cocktail Party Problem) didn't address this issue. In this longer piece (Audio Alchemy) the researchers spell out that they're looking at the "monoaural" case where there's only one microphone and thus no directional information about the talkers.
Reply | Report Abuse | Link to thisThere is a new separation "challenge" on at the moment that focuses on the task of extracting speech from recordings made "in a noisy living room" using two microphones: http://www.dcs.shef.ac.uk/spandh/chime/challenge
Results are going to be discussed at a conference in September.
As a native Japanese/English bilingual speaker, I've always thought it was easier to follow multiple overlapping conversations when they were spoken in different languages. I don't know whether this is because the spectral properties of the languages are different, or because I'm using different parts of my brain to understand the different languages simultaneously, but I would be interested to find out.
Reply | Report Abuse | Link to thisThat's an interesting comment. In our work we have found that differences such as gender, pitch and volume made it easier for our algorithm to separate the audio. Probably this is not a property of the algorithm, but rather of how the problem becomes easier the more different the signals are. If two twins are talking at the same time and volume and pause at the same time the computer can't possibly know who continues what -- short of using some language information.
Reply | Report Abuse | Link to thisExcellent article...very much enjoyed...
Reply | Report Abuse | Link to thisI am extremely hard of hearing. I use hearing aids and must give an external FM transmitter to a close companion in order to have a conversation with that companion in a crowded dinning room. My wife on the other hand has excellent hearing and discrimination. I am always astounded at her ability to discriminate and "listen in" to someone speaking several tables distant.
Reply | Report Abuse | Link to thisI especially have trouble hearing, and maybe even thinking, in a cocktail party situation. I wear two hearing aids. It's always seemed like if they could communicate with each other, they could recognise what sounds are in phase, therefore coming from directly in front of me, and amplify those sounds over others. These days they should be able to build the needed computer systems into the frames of my glasses.
Reply | Report Abuse | Link to thisI experienced (total) sudden hearing loss in one ear 2 years ago. I can now no longer "place" sounds: I can't tell where a noise is coming from. And in noisy environments, conversation is almost impossible. In noisy restaurants, even if I'm directly looking at the person speaking, I have great difficulty understanding what is said. My thoughts: with 2 ears, our brain constructs an "aural dome" of our environment, giving us a lot of information about our surroundings. Without these directional cues, it's all just noise -- which is my situation now. Seems to me solving the cocktail party problem could be vastly improved by a 2-microphone (i.e., stereo) recording. And breaking up the noise by separating it into tracks from discrete directions would seem to go a long way toward splitting the "noise" into multiple speech sources.
Reply | Report Abuse | Link to this