Synthetic-speech researchers at IBM have been tackling a much tougher challenge: making computers say anything a live person could say, and in a voice that sounds natural. (Hear a sample by clicking here.) For example, we've developed systems that can "read" a breaking news story or a bunch of e-mail messages aloud over the phone. Like the current phrase-splicing systems, our newest ones, called Supervoices, are also based on recordings of a human speaker and they can respond in real time. But the difference is that they can utter anything at all--including natural-sounding words the original speaker never said. (Try typing in your phrases here.)
What are the immediate uses of this technology? They include delivery of up-to-the-minute news, reading machines for the handicapped, automotive voice controls and retrieving e-mail over the phone--or any system where the vocabulary is large, the content changes frequently or unpredictably, and a visual display isn't practical. In the future Supervoices could enhance video and computer games, handheld devices and even motion-picture production. IBM released the latest generation of the technology for commercial use in late 2002.
Talk to Me
Scientists have attempted to simulate human speech since the late 1700s, when Wolfgang von Kempelen built a "Speaking Machine" that used an elaborate series of bellows, reeds, whistles and resonant chambers to produce rudimentary words. By the 1970s digital computing enabled the first generation of modern text-to-speech systems to reach fairly wide use. Makers of these systems attempted to model the entire speech production process directly, using a relatively small number of parameters. The result was intelligible, though somewhat robotic-sounding, speech. The advent of faster computers and inexpensive data storage in the late 1990s made today's most advanced synthetic speech possible. It is based on the premise that speech is composed of a finite number of linguistic building blocks called phonemes and that these can be arranged in new sequences to create any word. Therefore, a set of recordings of a speaker uttering all these building blocks can serve as a kind of typesetter's case for assembling speech.
Supervoices use this building-block model. While most of us think of language in terms of letters or words, the software treats it as a series of phonemes. English contains about 40 unique phonemes. For example, the word "please" is composed of four: P, L, EE and Z. Supervoice contains a collection of recorded samples of each phoneme. When it comes time to speak, the software grabs the appropriate samples needed to piece together new words.
Speech synthesis starts with a human voice, so our team typically auditions dozens of speakers to find the right one for a given task. We usually look for someone with an agreeable voice and who has good, clear pronunciation that is also free of any significant regional accent; at times, however, we may need other characteristics for a specialized application, such as synthesizing English with a foreign inflection or for a robot voice in a movie. The speaker who lands the part sits in a sound booth and reads several thousand sentences, which take more than a week to record. The sentences are chosen for their diverse phonetic content, to ensure that we capture lots of examples of all the English phonemes in many different contexts. The result is a collection of several thousand voice files.
Software then converts the written text from a series of words into one of phonemes. The software notes features of interest about each phoneme, such as what phonemes preceded and followed it, or whether it is the first or last one in a sentence. It also identifies parts of speech such as nouns or verbs in the text. So for example, if the speaker read "Welcome to my home page," the program would translate it into something like
W EH L K UH M T OO M I H OW M P AY J,
noting among other things that "page" is a noun, that the W was followed by an EH, and that the J was the final sound in the phrase.
Once the text has been processed, it's time to examine our sound files. We measure them for three features: pitch, timing and loudness, collectively known as prosody. These parameters will help us to decide later on which example of each sound we want to use to synthesize a given phrase. Pitch, timing, and loudness are moving targets that change from moment to moment. You can think of the measurements as a string of annotations along the sound file.
Next, using techniques borrowed from speech recognition (dictation programs that translate speech into text), the software correlates each recorded phoneme with its text counterpart. With the audio and the text aligned, we can look at a recorded voice file and pinpoint where each phoneme begins and ends. This is crucial; once we can locate and label the phonemes, our software can also precisely edit and catalogue them, and put them into a searchable database.
Our database contains an average of 10,000 recorded samples of each English phoneme. At first glance, that might appear to be a lot of redundancy. But the samples vary quite a lot because they were spoken with different pitches and came from different phonetic contexts. For example, let's look at one phoneme, the sound OO, as in "smooth." Some of the OO's in the database were originally followed by an L, as in "pool," while others were originally at the end of a word, as in "shampoo." These distinctions change the sound of the OO and therefore govern where we can use it later.
It's one thing to have a recorded library of all the English phonemes, but when it comes to synthesizing an expressive, natural-sounding sentence, we need to determine what characteristics each chunk of speech should have. For example, a speaker typically slows down before a pause, such as when a comma appears in the text. So we need to take note of the longer durations of the sounds preceding commas. The pitch is also likely to be lower in the speech before commas. We use our speaker's database to build a statistical model that infers generalities about the rise and fall of pitch, duration and loudness for that persons speech. The statistical model learns these generalities automatically and will apply them later to make its synthetic speech sound more natural.
Supervoices in Action
Now that we've "built" a system, let's put Supervoice to work. All of the processing Supervoice does occurs in milliseconds--fast enough that people can converse with the computer in real time. First well give it something to say, like "Can we have lunch today?" We have to convert the words into phonemes, the building blocks of Supervoices, which makes our sentence look like this:
K AE N W EE H AE V L UH N CH T OO D AY
Supervoice notes features of interest about the phrase, among them that it is a question, that the third word is a verb, and that the second syllable of the last word is stressed.
We enter the features we noted into the statistical model. Based on those features, it sketches the pitch, timing and loudness contour the sentence should follow. For example, the model should notice that it's a yes/no question and specify a rising pitch at the end of the sentence. Equipped with this contour, we have only to look in our database to find phonemes that match the curve. We hang our phoneme samples on this figurative armature. Which phoneme sample should we choose to synthesize each part of the sentence? Our sentence contains 16 individual phonemes, with a staggering 1064 (that is, 10,00016) possible permutations, too many to consider. So we use a technique called dynamic programming to search the database efficiently and find the best fit.
Once we've assembled the best-fit phonemes in a row, all that remains is smoothing. Though we've had many samples to select from and we've chosen them carefully, small discontinuities will remain at each splice. When adjacent samples deviate slightly in pitch, the sentence ends up with a jumpy, warbling sound. We fix this by forcing small pitch adjustments to correct it, like a carpenter sanding a series of glued joints to create a smooth, pleasing surface. We literally bend the pitch of each phoneme to match that of its neighbors. The result is polished-sounding conversational speech.
We often debate among ourselves the holy grail of text-to-speech technology. Should it be indistinguishable from a live human speaker, as in a Turing test? Probably not. For one thing, people wouldn't feel comfortable with the notion that they might be being "tricked" when, for example, they dial in to a company's service center. And anyway, there are situations where a natural human sound isn't the best choice, such as a voice that attempts to command your attention to prevent you from falling asleep while driving, or for cartoons, toys, and video and computer games, which might require characters that don't sound like a person. But the text-to-speech systems can do things that the average person can't, like speak dozens of languages almost as well as a native or recite an entire book without getting tired.
A better ultimate goal for the technology might be this: a pleasing, expressive voice that people feel comfortable listening to for long periods without perceived effort. Or perhaps one that is sophisticated enough to exploit the social and communication skills that we've all grown up with. Consider this example:
Caller: "I'd like a flight to Boston Tuesday morning."
Computer: "I have two flights available on Tuesday afternoon."
The software's ability to emphasize the word "afternoon" would simplify the exchange enormously. The caller implicitly understands that no flights are available in the morning, and that the computer is offering an alternative. In contrast, a completely unexpressive system could cause the caller to assume that the computer had misunderstood him, and he would probably end up repeating the request.
This sort of expressiveness is the biggest remaining challenge for technology like Supervoices, even though it already sounds astonishingly close to live human speech. After all, the software doesn't truly comprehend what it's saying, so it may lack subtle changes in speaking style that you'd expect from an eighth grader, who can interpret what he or she is reading. Given the limitless range of the human voice, we'll have our work cut out for us for a long time.
Andy Aaron, Ellen Eide and John F. Pitrelli work at the IBM T.J. Watson Research Center in Yorktown Heights, N.Y. Aaron received a B.A. degree in physics from Bard College and combines science and media experience in his work. He has worked on post-production sound at Francis Coppola's Zoetrope Studios and Lucasfilm's Skywalker Sound, and also recorded and created sound effects for dozens of major motion pictures. His sound-recording skills brought him to IBM, where he has recorded thousands of voices for the Human Language Technology Group, in an effort to model the many forms of human speech.
Eide, an electrical engineer with a Ph.D. from M.I.T., has worked on speech-recognition and speech synthesis since 1995 in IBM's Human Language Technology Group; prior to her current position, she worked in the Language Technologies group at BBN in Cambridge, Mass. Her research interests include statistical modeling, speech recognition and speech synthesis. She has numerous publications and patents in the fields of speech recognition and speech synthesis.
Pitrelli's Ph.D. work at M.I.T. in electrical engineering and computer science included speech recognition and synthesis. His research interests include speech synthesis, prosody, handwriting and speech recognition, statistical language modeling, and confidence modeling for recognition. Prior to his current position, he worked for seven years in the Speech Technology Group at NYNEX Science & Technology in White Plains, N.Y., and for five years was a research staff member in the IBM Pen Technologies Group. He has published 16 papers and holds two patents with three pending.