You are at a party, and Alex is telling a boring story. You are much more interested in the gossip that Sam is recounting to Pat, so you tune out Alex and focus on Sam’s words. Congratulations: you have just demonstrated the human ability to solve the “cocktail party problem”—to pick out one thread of speech from the babble of two or more people. Computers so far lack that power.
Although automated speech recognition is increasingly routine, it fails when faced with two people talking at once. Computerized speech separation would not only improve speech-recognition systems, it could also advance many other endeavors that require the separating of signals, such as making sense of brain-scan images.
The problem is devilishly hard. In the past several years, however, computer scientists have made exciting progress. One group has even achieved a very rare feat in automated perception: outperforming humans.
Why So Difficult?
Separating two streams of words is far more challenging than understanding the speech of one talker because the number of possible sound combinations is astronomical. Applying the usual techniques of ordinary (single-talker) speech recognition in brute-force fashion, to explore all the alternative ways that multiple talkers might have produced the combined sound, would be far too time-consuming. To solve the cocktail party problem efficiently, then, an algorithm must exploit special characteristics of speech sounds.
Whether one person is talking or many, the sound contains a spectrum of frequencies, and the intensity of each frequency changes on a millisecond timescale; spectrograms display data of this kind. Standard single-talker speech recognition analyzes the data at the level of phonemes, the individual units of sound that make up words—F-OH-N-EE-M-Z. Each spoken phoneme produces a variable but recognizable pattern in the spectrogram.
Statistical models play a major role in all speech recognition, specifying the expected probability that, for instance, an “oh” sound will be followed by an “n.” The recognition engine looks for the most likely sequences of phonemes and tries to build up whole words and plausible sentences.
When two people talk at once, the number of possibilities explodes. The frequency spectrum at each moment could come from any two phonemes, enunciated in any of the ways each person might use them in a word. Each additional talker makes the problem exponentially worse.
Fortunately, though, sounds of speech tend to be “sparse”: a spectrogram of two speakers usually has many a small region in which one speaker is much louder than the other. For those regions, ordinary speech recognition can find prospective phonemes matching the dominant speaker, greatly simplifying the search. It is by exploiting such features as sparseness that computer scientists have made great strides recently in finding shortcuts through the combinatorial jungle of speech separation. They follow two main approaches.
One method works from the bottom up, examining basic features in a spectrogram to discern which regions come from the same talker. For example, a sudden onset of sound at two different frequencies at the same instant probably comes from one talker.
This approach often also looks for spectrogram regions where neither talker dominates. The algorithms then set aside those corrupted regions and try to find phoneme sequences matching the clean regions. A group at the University of Sheffield in England has achieved good results using these methods. In a report published in 2010 comparing how well 10 different algorithms performed on a collection of benchmark overlapping speech samples, the Sheffield group had the third-best overall accuracy.
Most research groups, however, take a top-down, or “model-based,” approach. Their algorithms look for sequences of phonemes that are plausible individually and that combine to produce the total sound. Because considering every possible combination of overlapping phonemes is far too inefficient, the trick is to simplify or approximate the process without sacrificing too much accuracy.
Tuomas Virtanen of the Tampere University of Technology in Finland simplified the search by focusing alternately on each of the two talkers. In essence: given the current best estimate of talker A’s speech, search for talker B’s speech that best explains the total sound. Then keep repeating with the roles reversed every time. The Tampere algorithm edged out the Sheffield group’s for the second-highest accuracy, although it remained more than 10 percentage points behind human listeners.
The first-ever demonstration of “super-human” automated speech separation was achieved by a group at the IBM Thomas J. Watson Research Center in Yorktown Heights, N.Y. This team’s latest algorithm works efficiently even when more than two people are talking—it has separated speech streams of four overlapping talkers. In part, the algorithm carries out the usual top-down analysis, evaluating trial sequences of phonemes for all the speakers. Between iterations of this search, the program uses its most promising estimates of the speech to look for spectrogram regions where one talker was loud enough to mask the others. Interestingly, attending to such masking makes it practical to refine the estimate of all the talkers’ speech simultaneously.
Automated speech separation still has a long way to go before computers will be able to routinely eavesdrop on gossip at noisy parties. Yet the recent results suggest that prospect may finally be coming into view, if not yet within earshot.