4. Beads on a string model of speech
Computers encode the structure of speech in models, which typically represent thousands of words, each word built from a multitude of possible sounds. The most commonly used speech model is the so called beads-on-a-string model.
The pronunciation of a word is represented as a series of basic sound units called phonemes. For instance, “white” can be represented by the phoneme sequence “W-AY-T” (“AY” is used to represent the sound of the letter “i” in “white”). Each phoneme is further divided into a sequence of states—the “beads”—which represent how the sound power spectrum changes over the duration of a phoneme.
The duration of each phoneme state will vary depending on the speaker and the manner of speech. Speech models encode this variability in duration using probabilities: If the person is saying “white” and the sound in one time frame is phoneme state AY-1, the model specifies the probability of the sound switching to state AY-2 in the next frame.
Probabilities also enter the picture because each phoneme state can correspond to many different possible spectra in a frame of the spectrogram, again depending on all the different ways that someone could produce the sound—tonal inflections, the stress on the syllable, the overall pitch and timbre of the voice and so on.
5. Hidden Markov model
The full speech recognition model also incorporates the probabilities of different sequences of words. In the test speech samples, these probabilities are very simple—after the word “place,” for instance, the words “red,” “green,” “white” and “blue” each have a probability of 25 percent, and all other words have zero probability. Things get much more complicated in real-world speech recognition.
This structure forms what is called the language model, or the task grammar:
When the language model is combined with the beads-on-a-string model of phonemes, the result is a model of how the phoneme states may change in the course of a sentence, including silences that may or may not occur between words:
The full structure—including the probabilities for the sound spectra expected to appear in a spectrogram frame when the person is speaking a specific phoneme state—is known as a hidden Markov model, named after the 19th century Russian mathematician Andrey Andreyevich Markov.
A “Markov process” or “Markov chain” is a sequence of random states in which the probability of what comes at the next time step depends only on the current state and not on anything earlier. Thus the model only cares that the current state is phoneme state AY-1 in the word “white.” It doesn’t care if the previous state was W-3 or AY-1.
The Markov model is “hidden” because all that the computer gets to see directly are the sound spectra. The computer cannot see the phoneme state AY-1 directly—it can only estimate the probability of state AY-1 based on the observed sound spectra.
Researchers at IBM pioneered the use of hidden Markov models for speech recognition in the early 1970s. The recognition algorithm searches for the sequence of states in the speech model that best explains the observed sounds. Carrying out that search efficiently involves a process called belief propagation.
6. A dicey game: Belief propagation
A dice game illustrates how belief propagation works.
The game involves two dice. One is a fair die that is equally likely to land on each number from 1 to 6. The other die is loaded so that it always lands on the same number, but we do not know what that number is. The dice are rolled repeatedly, and we are told only their sum each time. Our task is to determine the numbers that the dice showed.
Before the first roll, we know nothing about which number the loaded die always shows. We believe each number to be equally likely. We can represent our beliefs for the two dice with a grid colored to show the uniform probabilities for each combination of numbers:
The first roll has a sum of 9. We know there are four ways to get that total with the two dice: 3+6, 4+5, 5+4, or 6+3. We now know the loaded die could not be a 1 or 2.
We update our belief about the loaded die: 1 and 2 have zero probability, and 3, 4, 5 and 6 each have 25 percent probability. We can propagate this belief forward to the next roll.
The second roll has a sum of 8. We learn nothing new about the loaded die because all the possibilities (3, 4, 5, 6) can combine with a fair die to make 8. We propagate our belief about the loaded die to the next roll.
The third roll has a sum of 5. We now know the loaded die must be either 3 or 4 because 5 and 6 are too big.
Again we update our beliefs accordingly and propagate them forward to the next dice roll.
The fourth roll has a sum of 10. We now know the loaded die must be 4 because the only other remaining possibility, 3, is too small.
As well as updating our beliefs to reflect this certainty about the loaded die, we can now propagate our beliefs backward and infer the number on the fair die in each of the rolls.
As well as illustrating the general idea of belief propagation, this example has also shown how the technique can separate two “signals” (the numbers on the dice) given some knowledge about how the signals behave over time (in this case, that the loaded die never changes).