13. The masking function to the rescue
The way that the spectra of two speech signals combine to form the spectrum of the total sound is complicated. Because of the structure of speech, however, the louder of two sounds at a given frequency typically has significantly more power than the quieter sound. The total sound at each frequency is roughly equal to the louder sound because the quieter sounds usually contribute too little to be significant.
The figure shows this effect in the spectrum of sound from two overlapping speakers for a single 40-millisecond time frame (one time slice from a spectrogram). Talker A, a female, is louder over most of the higher frequencies whereas talker B, a male, is louder over most of the lower frequencies. In most frequency regions, the spectrum of the dominant speaker is very close to the spectrum of the mixture.
We can exploit this feature by using a so-called masking function. This function indicates which speaker dominates at each frequency band in a frame. The masking function for the whole spectrogram divides the data into regions in which talker A dominates and regions in which talker B dominates. If we know the masking function, we can evaluate both talkers’ speech independently: each talker’s speech must match the total sound where that talker dominates and it must be quieter than the total sound elsewhere. Finding the best estimate of A’s speech no longer depends on our estimate of B’s speech except for the information carried by the masking function.
Recently we discovered how to separate speech efficiently using these masking functions. We do so by alternating between two steps. In one step we use our current beliefs about the talkers’ speech to estimate beliefs (probabilities) about the masking functions. In the other, we use the masking functions to update our beliefs about the speech of all the speakers.
The ability to carry along many possible masking functions through the iterations greatly reduces the chances of getting stuck in the wrong valley of the landscape.
14. Multiple talkers
The new algorithm based on speaker masking makes it possible to separate more complicated speech mixtures that were previously beyond the reach of computerized analysis. For example, the algorithm has successfully extracted the speech of one talker from the babble of three other talkers:
- Click arrow for audio
- MP3 file
The mixed speech consists of people saying these sentences:
Lay white at K 5 again.
Bin blue by M zero now.
Set green in M 7 please.
Lay green with S 7 please
Our algorithm’s reconstruction of the third of these speakers sounds is very clear:
- Click arrow for audio
- MP3 file
The task of discerning that talker is made all the harder because some of the other people are speaking at a louder volume. Despite that challenge, our algorithm reconstructs the target speech accurately enough to recognize the words correctly.
These spectrograms illustrate the mixed speech, the actual target speech and our reconstruction of it:
The color scale goes from low power (blue) to high power (red).
The masking function that makes this speech separation possible can also be depicted as a kind of spectrogram. In these images, a white area indicates that the target speaker dominates the audio signal in that frequency band at that time and black areas indicate where sound from the other speakers dominates.
By the end of the iteration process, our algorithm’s estimate of the masking function based on its analysis of the mixed audio is quite close to the actual masking function computed using the true audio signals that made up the mixture.
This approach to speech separation has applications far more important than audio surveillance or improvement of speech recognition in general. The basic problem of separating signals is fundamental, appearing in a broad spectrum of applications ranging from brain imaging analysis to development of intelligent robots. It will be some time before a mechanical bartender takes your cocktail order over the chatter of a festive party, but the era of robust robotic audio interaction has now begun.