Software then converts the written text from a series of words into one of phonemes. The software notes features of interest about each phoneme, such as what phonemes preceded and followed it, or whether it is the first or last one in a sentence. It also identifies parts of speech such as nouns or verbs in the text. So for example, if the speaker read "Welcome to my home page," the program would translate it into something like
W¿ EH¿ L¿ K ¿ UH¿ M¿ T¿ OO¿ M¿ I ¿ H¿ OW¿ M¿ P¿ AY ¿ J,
noting among other things that "page" is a noun, that the W was followed by an EH, and that the J was the final sound in the phrase.
Once the text has been processed, it's time to examine our sound files. We measure them for three features: pitch, timing and loudness, collectively known as prosody. These parameters will help us to decide later on which example of each sound we want to use to synthesize a given phrase. Pitch, timing, and loudness are moving targets that change from moment to moment. You can think of the measurements as a string of annotations along the sound file.
Next, using techniques borrowed from speech recognition (dictation programs that translate speech into text), the software correlates each recorded phoneme with its text counterpart. With the audio and the text aligned, we can look at a recorded voice file and pinpoint where each phoneme begins and ends. This is crucial; once we can locate and label the phonemes, our software can also precisely edit and catalogue them, and put them into a searchable database.
Our database contains an average of 10,000 recorded samples of each English phoneme. At first glance, that might appear to be a lot of redundancy. But the samples vary quite a lot because they were spoken with different pitches and came from different phonetic contexts. For example, let's look at one phoneme, the sound OO, as in "smooth." Some of the OO's in the database were originally followed by an L, as in "pool," while others were originally at the end of a word, as in "shampoo." These distinctions change the sound of the OO and therefore govern where we can use it later.
It's one thing to have a recorded library of all the English phonemes, but when it comes to synthesizing an expressive, natural-sounding sentence, we need to determine what characteristics each chunk of speech should have. For example, a speaker typically slows down before a pause, such as when a comma appears in the text. So we need to take note of the longer durations of the sounds preceding commas. The pitch is also likely to be lower in the speech before commas. We use our speaker's database to build a statistical model that infers generalities about the rise and fall of pitch, duration and loudness for that person¿s speech. The statistical model learns these generalities automatically and will apply them later to make its synthetic speech sound more natural.
Supervoices in Action
Now that we've "built" a system, let's put Supervoice to work. All of the processing Supervoice does occurs in milliseconds--fast enough that people can converse with the computer in real time. First we¿ll give it something to say, like "Can we have lunch today?" We have to convert the words into phonemes, the building blocks of Supervoices, which makes our sentence look like this:
K¿¿ AE¿¿ N¿¿ W¿¿ EE¿¿ H ¿¿ AE¿¿ V¿¿ L ¿¿ UH¿¿ N ¿¿ CH¿¿ T ¿¿ OO ¿¿ D¿¿ AY
Supervoice notes features of interest about the phrase, among them that it is a question, that the third word is a verb, and that the second syllable of the last word is stressed.
We enter the features we noted into the statistical model. Based on those features, it sketches the pitch, timing and loudness contour the sentence should follow. For example, the model should notice that it's a yes/no question and specify a rising pitch at the end of the sentence. Equipped with this contour, we have only to look in our database to find phonemes that match the curve. We hang our phoneme samples on this figurative armature. Which phoneme sample should we choose to synthesize each part of the sentence? Our sentence contains 16 individual phonemes, with a staggering 1064 (that is, 10,00016) possible permutations, too many to consider. So we use a technique called dynamic programming to search the database efficiently and find the best fit.



See what we're tweeting about


3 Comments
Add Commenttimmy libila libila timmy
Reply | Report Abuse | Link to thisThis approach does seem to be getting closer to the "holy grail" of speech synthesis... closer to HAL in 2001 or the Enterprise computer in Star Trek: The Next Generation, and further away from the voice we all now associate with Steven Hawking. (Yes, HAL is creepy for its manipulative, homicidal, misguided reasoning, but we can all agree its what we'd like our computer to sound like when it talks to us.) However, I'd like to suggest SciAm fix the links in the article -- I'd really like to try the system out for myself.
Reply | Report Abuse | Link to thisI find this article very amusing. I played with this on a computer as a kid and I tried to create British English and ended up adding extra vowels to create British sounding words. Now who has the accent?
Reply | Report Abuse | Link to this