Once we've assembled the best-fit phonemes in a row, all that remains is smoothing. Though we've had many samples to select from and we've chosen them carefully, small discontinuities will remain at each splice. When adjacent samples deviate slightly in pitch, the sentence ends up with a jumpy, warbling sound. We fix this by forcing small pitch adjustments to correct it, like a carpenter sanding a series of glued joints to create a smooth, pleasing surface. We literally bend the pitch of each phoneme to match that of its neighbors. The result is polished-sounding conversational speech.
We often debate among ourselves the holy grail of text-to-speech technology. Should it be indistinguishable from a live human speaker, as in a Turing test? Probably not. For one thing, people wouldn't feel comfortable with the notion that they might be being "tricked" when, for example, they dial in to a company's service center. And anyway, there are situations where a natural human sound isn't the best choice, such as a voice that attempts to command your attention to prevent you from falling asleep while driving, or for cartoons, toys, and video and computer games, which might require characters that don't sound like a person. But the text-to-speech systems can do things that the average person can't, like speak dozens of languages almost as well as a native or recite an entire book without getting tired.
A better ultimate goal for the technology might be this: a pleasing, expressive voice that people feel comfortable listening to for long periods without perceived effort. Or perhaps one that is sophisticated enough to exploit the social and communication skills that we've all grown up with. Consider this example:
Caller: "I'd like a flight to Boston Tuesday morning."
Computer: "I have two flights available on Tuesday afternoon."
The software's ability to emphasize the word "afternoon" would simplify the exchange enormously. The caller implicitly understands that no flights are available in the morning, and that the computer is offering an alternative. In contrast, a completely unexpressive system could cause the caller to assume that the computer had misunderstood him, and he would probably end up repeating the request.
This sort of expressiveness is the biggest remaining challenge for technology like Supervoices, even though it already sounds astonishingly close to live human speech. After all, the software doesn't truly comprehend what it's saying, so it may lack subtle changes in speaking style that you'd expect from an eighth grader, who can interpret what he or she is reading. Given the limitless range of the human voice, we'll have our work cut out for us for a long time.
Andy Aaron, Ellen Eide and John F. Pitrelli work at the IBM T.J. Watson Research Center in Yorktown Heights, N.Y. Aaron received a B.A. degree in physics from Bard College and combines science and media experience in his work. He has worked on post-production sound at Francis Coppola's Zoetrope Studios and Lucasfilm's Skywalker Sound, and also recorded and created sound effects for dozens of major motion pictures. His sound-recording skills brought him to IBM, where he has recorded thousands of voices for the Human Language Technology Group, in an effort to model the many forms of human speech.
Eide, an electrical engineer with a Ph.D. from M.I.T., has worked on speech-recognition and speech synthesis since 1995 in IBM's Human Language Technology Group; prior to her current position, she worked in the Language Technologies group at BBN in Cambridge, Mass. Her research interests include statistical modeling, speech recognition and speech synthesis. She has numerous publications and patents in the fields of speech recognition and speech synthesis.
Pitrelli's Ph.D. work at M.I.T. in electrical engineering and computer science included speech recognition and synthesis. His research interests include speech synthesis, prosody, handwriting and speech recognition, statistical language modeling, and confidence modeling for recognition. Prior to his current position, he worked for seven years in the Speech Technology Group at NYNEX Science & Technology in White Plains, N.Y., and for five years was a research staff member in the IBM Pen Technologies Group. He has published 16 papers and holds two patents with three pending.