Making Computers Talk

Say good-bye to stilted electronic chatter: new synthetic-speech systems sound authentically human, and they can respond in real time















Share on Tumblr

Software then converts the written text from a series of words into one of phonemes. The software notes features of interest about each phoneme, such as what phonemes preceded and followed it, or whether it is the first or last one in a sentence. It also identifies parts of speech such as nouns or verbs in the text. So for example, if the speaker read "Welcome to my home page," the program would translate it into something like

W¿ EH¿ L¿ K ¿ UH¿ M¿ T¿ OO¿ M¿ I ¿ H¿ OW¿ M¿ P¿ AY ¿ J,

noting among other things that "page" is a noun, that the W was followed by an EH, and that the J was the final sound in the phrase.

Once the text has been processed, it's time to examine our sound files. We measure them for three features: pitch, timing and loudness, collectively known as prosody. These parameters will help us to decide later on which example of each sound we want to use to synthesize a given phrase. Pitch, timing, and loudness are moving targets that change from moment to moment. You can think of the measurements as a string of annotations along the sound file.

Next, using techniques borrowed from speech recognition (dictation programs that translate speech into text), the software correlates each recorded phoneme with its text counterpart. With the audio and the text aligned, we can look at a recorded voice file and pinpoint where each phoneme begins and ends. This is crucial; once we can locate and label the phonemes, our software can also precisely edit and catalogue them, and put them into a searchable database.

Our database contains an average of 10,000 recorded samples of each English phoneme. At first glance, that might appear to be a lot of redundancy. But the samples vary quite a lot because they were spoken with different pitches and came from different phonetic contexts. For example, let's look at one phoneme, the sound OO, as in "smooth." Some of the OO's in the database were originally followed by an L, as in "pool," while others were originally at the end of a word, as in "shampoo." These distinctions change the sound of the OO and therefore govern where we can use it later.

It's one thing to have a recorded library of all the English phonemes, but when it comes to synthesizing an expressive, natural-sounding sentence, we need to determine what characteristics each chunk of speech should have. For example, a speaker typically slows down before a pause, such as when a comma appears in the text. So we need to take note of the longer durations of the sounds preceding commas. The pitch is also likely to be lower in the speech before commas. We use our speaker's database to build a statistical model that infers generalities about the rise and fall of pitch, duration and loudness for that person¿s speech. The statistical model learns these generalities automatically and will apply them later to make its synthetic speech sound more natural.

Supervoices in Action

Now that we've "built" a system, let's put Supervoice to work. All of the processing Supervoice does occurs in milliseconds--fast enough that people can converse with the computer in real time. First we¿ll give it something to say, like "Can we have lunch today?" We have to convert the words into phonemes, the building blocks of Supervoices, which makes our sentence look like this:

K¿¿ AE¿¿ N¿¿ W¿¿ EE¿¿ H ¿¿ AE¿¿ V¿¿ L ¿¿ UH¿¿ N ¿¿ CH¿¿ T ¿¿ OO ¿¿ D¿¿ AY

Supervoice notes features of interest about the phrase, among them that it is a question, that the third word is a verb, and that the second syllable of the last word is stressed.

We enter the features we noted into the statistical model. Based on those features, it sketches the pitch, timing and loudness contour the sentence should follow. For example, the model should notice that it's a yes/no question and specify a rising pitch at the end of the sentence. Equipped with this contour, we have only to look in our database to find phonemes that match the curve. We hang our phoneme samples on this figurative armature. Which phoneme sample should we choose to synthesize each part of the sentence? Our sentence contains 16 individual phonemes, with a staggering 1064 (that is, 10,00016) possible permutations, too many to consider. So we use a technique called dynamic programming to search the database efficiently and find the best fit.



3 Comments

Add Comment
View
  1. 1. kadena56 11:54 PM 11/16/07

    timmy libila libila timmy

    Reply | Report Abuse | Link to this
  2. 2. metamorphmuses 11:48 PM 4/16/09

    This approach does seem to be getting closer to the "holy grail" of speech synthesis... closer to HAL in 2001 or the Enterprise computer in Star Trek: The Next Generation, and further away from the voice we all now associate with Steven Hawking. (Yes, HAL is creepy for its manipulative, homicidal, misguided reasoning, but we can all agree its what we'd like our computer to sound like when it talks to us.) However, I'd like to suggest SciAm fix the links in the article -- I'd really like to try the system out for myself.

    Reply | Report Abuse | Link to this
  3. 3. gigabetz 07:45 PM 4/23/09

    I find this article very amusing. I played with this on a computer as a kid and I tried to create British English and ended up adding extra vowels to create British sounding words. Now who has the accent?

    Reply | Report Abuse | Link to this
Leave this field empty

Add a Comment

You must sign in or register as a ScientificAmerican.com member to submit a comment.
Click one of the buttons below to register using an existing Social Account.

More from Scientific American

See what we're tweeting about

Scientific American Editors

Tweets could not be retrieved at this time

Free Newsletters


Get the best from Scientific American in your inbox

Solve Innovation Challenges

Powered By: Innocentive

  SA Digital
  SA Digital

Science Jobs of the Week

Email this Article

Making Computers Talk

X
Scientific American MIND iPad

Tap into your MIND

Get Both Print & Tablet Editions for one low price!

Subscribe Now >>

X

Please Log In

Forgot: Password

X

Account Linking

Welcome, . Do you have an existing ScientificAmerican.com account?

Yes, please link my existing account with for quick, secure access.



Forgot Password?

No, I would like to create a new account with my profile information.

Create Account
X

Report Abuse

Are you sure?

X

Institutional Access

It has been identified that the institution you are trying to access this article from has institutional site license access to Scientific American on nature.com. To access this article in its entirety through site license access, click below.

Site license access
X

Error

X

Share this Article

X