Making Computers Talk

Say good-bye to stilted electronic chatter: new synthetic-speech systems sound authentically human, and they can respond in real time















Share on Tumblr

Once we've assembled the best-fit phonemes in a row, all that remains is smoothing. Though we've had many samples to select from and we've chosen them carefully, small discontinuities will remain at each splice. When adjacent samples deviate slightly in pitch, the sentence ends up with a jumpy, warbling sound. We fix this by forcing small pitch adjustments to correct it, like a carpenter sanding a series of glued joints to create a smooth, pleasing surface. We literally bend the pitch of each phoneme to match that of its neighbors. The result is polished-sounding conversational speech.

Future Directions

We often debate among ourselves the holy grail of text-to-speech technology. Should it be indistinguishable from a live human speaker, as in a Turing test? Probably not. For one thing, people wouldn't feel comfortable with the notion that they might be being "tricked" when, for example, they dial in to a company's service center. And anyway, there are situations where a natural human sound isn't the best choice, such as a voice that attempts to command your attention to prevent you from falling asleep while driving, or for cartoons, toys, and video and computer games, which might require characters that don't sound like a person. But the text-to-speech systems can do things that the average person can't, like speak dozens of languages almost as well as a native or recite an entire book without getting tired.

A better ultimate goal for the technology might be this: a pleasing, expressive voice that people feel comfortable listening to for long periods without perceived effort. Or perhaps one that is sophisticated enough to exploit the social and communication skills that we've all grown up with. Consider this example:

Caller: "I'd like a flight to Boston Tuesday morning."
Computer: "I have two flights available on Tuesday afternoon."

The software's ability to emphasize the word "afternoon" would simplify the exchange enormously. The caller implicitly understands that no flights are available in the morning, and that the computer is offering an alternative. In contrast, a completely unexpressive system could cause the caller to assume that the computer had misunderstood him, and he would probably end up repeating the request.

This sort of expressiveness is the biggest remaining challenge for technology like Supervoices, even though it already sounds astonishingly close to live human speech. After all, the software doesn't truly comprehend what it's saying, so it may lack subtle changes in speaking style that you'd expect from an eighth grader, who can interpret what he or she is reading. Given the limitless range of the human voice, we'll have our work cut out for us for a long time.


Andy Aaron, Ellen Eide and John F. Pitrelli work at the IBM T.J. Watson Research Center in Yorktown Heights, N.Y. Aaron received a B.A. degree in physics from Bard College and combines science and media experience in his work. He has worked on post-production sound at Francis Coppola's Zoetrope Studios and Lucasfilm's Skywalker Sound, and also recorded and created sound effects for dozens of major motion pictures. His sound-recording skills brought him to IBM, where he has recorded thousands of voices for the Human Language Technology Group, in an effort to model the many forms of human speech.

Eide, an electrical engineer with a Ph.D. from M.I.T., has worked on speech-recognition and speech synthesis since 1995 in IBM's Human Language Technology Group; prior to her current position, she worked in the Language Technologies group at BBN in Cambridge, Mass. Her research interests include statistical modeling, speech recognition and speech synthesis. She has numerous publications and patents in the fields of speech recognition and speech synthesis.

Pitrelli's Ph.D. work at M.I.T. in electrical engineering and computer science included speech recognition and synthesis. His research interests include speech synthesis, prosody, handwriting and speech recognition, statistical language modeling, and confidence modeling for recognition. Prior to his current position, he worked for seven years in the Speech Technology Group at NYNEX Science & Technology in White Plains, N.Y., and for five years was a research staff member in the IBM Pen Technologies Group. He has published 16 papers and holds two patents with three pending.



3 Comments

Add Comment
View
  1. 1. kadena56 11:54 PM 11/16/07

    timmy libila libila timmy

    Reply | Report Abuse | Link to this
  2. 2. metamorphmuses 11:48 PM 4/16/09

    This approach does seem to be getting closer to the "holy grail" of speech synthesis... closer to HAL in 2001 or the Enterprise computer in Star Trek: The Next Generation, and further away from the voice we all now associate with Steven Hawking. (Yes, HAL is creepy for its manipulative, homicidal, misguided reasoning, but we can all agree its what we'd like our computer to sound like when it talks to us.) However, I'd like to suggest SciAm fix the links in the article -- I'd really like to try the system out for myself.

    Reply | Report Abuse | Link to this
  3. 3. gigabetz 07:45 PM 4/23/09

    I find this article very amusing. I played with this on a computer as a kid and I tried to create British English and ended up adding extra vowels to create British sounding words. Now who has the accent?

    Reply | Report Abuse | Link to this
Leave this field empty

Add a Comment

You must sign in or register as a ScientificAmerican.com member to submit a comment.
Click one of the buttons below to register using an existing Social Account.

More from Scientific American

See what we're tweeting about

Scientific American Editors

More »

Free Newsletters


Get the best from Scientific American in your inbox

Solve Innovation Challenges

Powered By: Innocentive

  SA Digital
  SA Digital

Science Jobs of the Week

Email this Article

Making Computers Talk

X
Scientific American Magazine

Subscribe Today

Save 66% off the cover price and get a free gift!

Learn More >>

X

Please Log In

Forgot: Password

X

Account Linking

Welcome, . Do you have an existing ScientificAmerican.com account?

Yes, please link my existing account with for quick, secure access.



Forgot Password?

No, I would like to create a new account with my profile information.

Create Account
X

Report Abuse

Are you sure?

X

Institutional Access

It has been identified that the institution you are trying to access this article from has institutional site license access to Scientific American on nature.com. To access this article in its entirety through site license access, click below.

Site license access
X

Error

X

Share this Article

X