Call a large company these days, and you will probably start by having a conversation with a computer. Until recently, such automated telephone speech systems could string together only prerecorded phrases. Think of the robotic-sounding "The number you have dialed ... 5 ... 5 ... 5 ... 1 ... 2 ... 1 ... 2...." Unfortunately, this stilted computer speech leaves people cold. And because these systems cannot stray from their canned phrases, their abilities are limited.
Computer-generated speech has improved during the past decade, becoming significantly more intelligible and easier to listen to. But researchers now face a more formidable challenge: making synthesized speech closer to that of real humans--by giving it the ability to modulate tone and expression, for example--so that it can better communicate meaning. This elusive goal requires a deep understanding of the components of speech and of the subtle effects of a person's volume, pitch, timing and emphasis. That is the aim of our research group at IBM and those of other U.S. companies, such as AT&T, Nuance, Cepstral and ScanSoft, as well as investigators at institutions including Carnegie Mellon University, the University of California at Los Angeles, the Massachusetts Institute of Technology and the Oregon Graduate Institute. Like earlier phrase-splicing approaches, the latest generation of speech technology--our version is code-named the IBM Natural Expressive Speech Synthesizer, or the NAXPRES Synthesizer--is based on recordings of human speakers and can respond in real time. The difference is that the new systems can say anything at all--including natural-sounding words the recorded speakers never said.