From Aerosmith to Pavarotti -- How Humans Sing

How Does The Singer's Voice Produce Those Amazing Sounds?

The year is 1974, and Harry Caul is monitoring a couple walking through a crowded Union Square in San Francisco. He uses shotgun microphones to secretly record their conversation, but at a critical point, a nearby percussion band drowns out the conversation. Ultimately Harry has to use an improbable gadget to extract the nearly inaudible words, “He’d kill us if he got the chance,” from the recordings.

This piece of audio forensics was science fiction when it appeared in the movie The Conversation more than three decades ago. Is it possible today?

Sorting out the babble from multiple conversations is popularly known as the “cocktail party problem,” and researchers have made many inroads toward solving it in the past 10 years. Human listeners can selectively tune out all but the speaker of interest when multiple speakers are talking. Unlike people, machines have been notoriously unreliable at recognizing speech in the presence of noise, especially when the noise is background speech. Speech recognition technology is becoming increasingly ubiquitous and is now being used for dictating text and commands to computers, phones and GPS devices. But good luck getting anything but gibberish if two people speak at once.

On supporting science journalism

If you're enjoying this article, consider supporting our award-winning journalism by subscribing. By purchasing a subscription you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today.

A flurry of recent research has focused on the cocktail party problem. In 2006, Martin Cooke of the University of Sheffield in England and Te-Won Lee of the University of California, San Diego, organized a speech separation “challenge,” a task designed to compare different approaches to separating and recognizing the mixed speech of two talkers. Since then, researchers around the world have built systems to compete against one another and against the ultimate benchmark: human listeners.

Here, we survey the computational challenges of speech separation and outline the techniques used to tackle the problem. In particular we describe the workings of the “superhuman” algorithm that three of us (along with our colleague Trausti T. Kristjansson of Google) entered in the separation challenge. We then describe a subsequent algorithm, which can efficiently solve more complicated separation problems with more than two speakers that would take eons to unravel with the original approach. (See also "Solving the Cocktail Party Problem" from the April 2011 issue of Scientific American.)

1. Try it Yourself
To get an idea of what the speech separation algorithms are up against, see if you can hear the target words in some overlapping speech of the kind used in the challenge. All the sentences spoken in the samples use a very limited vocabulary and have the same simple structure as this example: “Place red at C two now.” (The sentences may seem less strange if you imagine they are instructions about what to do with colored tokens in a board game.)

In each mixture, one of the talkers mentions “white.” Your goal is to discern the letter-number combination (“C two” in the example) spoken in the sentence about “white.”

The limited vocabulary and simplified grammar allow the research to focus on the challenge of separating overlapping speech without requiring the infrastructure needed for recognizing more complicated utterances. The algorithms processed a few thousand such test samples, which varied in several ways. In some samples, the “target” and “masking” talker were equally loud, but mostly they differed slightly or moderately in volume. The target and masker could have different genders or the same gender, or they could even be the same person speaking both sentences. Human listeners have the greatest trouble when the target is the same person, speaking at about the same or slightly lower volume than the masker.

2. How spectrograms represent speech

MP3 file

To separate the speech of multiple talkers or to recognize one person’s speech, computers represent the sound signal by its spectrum—the energy in the sound at each frequency. A spectrogram displays how the spectrum varies over time, with the color at each point indicating the sound energy at that frequency and time. The spectrogram conveys all the information needed to recognize speech. In fact, computer scientist Victor Zue of the Massachusetts Institute of Technology used to teach a class on how to transcribe speech by just looking at a spectrogram.

To produce a spectrogram, software divides the sound signal into short, overlapping time segments called frames, each about 40 milliseconds long (1/25th of a second). The overlap avoids losing information at the start and end of each frame. The sound spectrum is determined for each frame. Thus a spectrogram is a series of individual spectra, one for each frame. Speech recognition and speech separation typically works by moving along a spectrogram one frame at a time.

3. Spectrogram of overlapping speech

Mixing audio sources together is a little like pouring milk into coffee. Once they blend together, there is no simple way of separating them. In each time frame, the spectrum from each source essentially adds together. In principle, dividing the sound into two parts is as arbitrary as asking, “If x plus y equals 10, what are x and y?”

At a real cocktail party, you get some extra information by having two ears. The slightly different sound detected by each ear tells you something about the directions the sounds are coming from, which can help you in picking out one talker from the crowd. But you get no such assistance if the two people are in the same general direction and neither does a computer processing a recording made with a single microphone. The speech separation challenge focused on this “monaural” version of the problem.

Fortunately, as is apparent by looking at spectrograms, the sounds of speech have a lot of structure. All approaches to speech separation exploit this structure to some degree.

It’s Time to Stand Up for Science

If you enjoyed this article, I’d like to ask for your support. Scientific American has served as an advocate for science and industry for 180 years, and right now may be the most critical moment in that two-century history.

I’ve been a Scientific American subscriber since I was 12 years old, and it helped shape the way I look at the world. SciAm always educates and delights me, and inspires a sense of awe for our vast, beautiful universe. I hope it does that for you, too.

If you subscribe to Scientific American, you help ensure that our coverage is centered on meaningful research and discovery; that we have the resources to report on the decisions that threaten labs across the U.S.; and that we support both budding and working scientists at a time when the value of science itself too often goes unrecognized.

In return, you get essential news, captivating podcasts, brilliant infographics, can't-miss newsletters, must-watch videos, challenging games, and the science world's best writing and reporting. You can even gift someone a subscription.

There has never been a more important time for us to stand up and show why science matters. I hope you’ll support us in that mission.

Thank you,

David M. Ewalt, Editor in Chief, Scientific American