Dial up a bank or airline these days, and chances are your call will be answered by a prerecorded voice rather than a live human being. By stringing together several canned phrases, such systems do an adequate job of bringing a banking or ticket-booking transaction to a successful conclusion. Though the cobbled-together speech sounds stilted, these systems are sufficient for handling limited transactions, where the subject matter is known in advance. But since they can't stray from their prerecorded phrases, their capabilities are limited.
Synthetic-speech researchers at IBM have been tackling a much tougher challenge: making computers say anything a live person could say, and in a voice that sounds natural. (Hear a sample by clicking here.) For example, we've developed systems that can "read" a breaking news story or a bunch of e-mail messages aloud over the phone. Like the current phrase-splicing systems, our newest ones, called Supervoices, are also based on recordings of a human speaker and they can respond in real time. But the difference is that they can utter anything at all--including natural-sounding words the original speaker never said. (Try typing in your phrases here.)
What are the immediate uses of this technology? They include delivery of up-to-the-minute news, reading machines for the handicapped, automotive voice controls and retrieving e-mail over the phone--or any system where the vocabulary is large, the content changes frequently or unpredictably, and a visual display isn't practical. In the future Supervoices could enhance video and computer games, handheld devices and even motion-picture production. IBM released the latest generation of the technology for commercial use in late 2002.
Talk to Me
Scientists have attempted to simulate human speech since the late 1700s, when Wolfgang von Kempelen built a "Speaking Machine" that used an elaborate series of bellows, reeds, whistles and resonant chambers to produce rudimentary words. By the 1970s digital computing enabled the first generation of modern text-to-speech systems to reach fairly wide use. Makers of these systems attempted to model the entire speech production process directly, using a relatively small number of parameters. The result was intelligible, though somewhat robotic-sounding, speech. The advent of faster computers and inexpensive data storage in the late 1990s made today's most advanced synthetic speech possible. It is based on the premise that speech is composed of a finite number of linguistic building blocks called phonemes and that these can be arranged in new sequences to create any word. Therefore, a set of recordings of a speaker uttering all these building blocks can serve as a kind of typesetter's case for assembling speech.
Supervoices use this building-block model. While most of us think of language in terms of letters or words, the software treats it as a series of phonemes. English contains about 40 unique phonemes. For example, the word "please" is composed of four: P, L, EE and Z. Supervoice contains a collection of recorded samples of each phoneme. When it comes time to speak, the software grabs the appropriate samples needed to piece together new words.
Speech synthesis starts with a human voice, so our team typically auditions dozens of speakers to find the right one for a given task. We usually look for someone with an agreeable voice and who has good, clear pronunciation that is also free of any significant regional accent; at times, however, we may need other characteristics for a specialized application, such as synthesizing English with a foreign inflection or for a robot voice in a movie. The speaker who lands the part sits in a sound booth and reads several thousand sentences, which take more than a week to record. The sentences are chosen for their diverse phonetic content, to ensure that we capture lots of examples of all the English phonemes in many different contexts. The result is a collection of several thousand voice files.