After Roger Ebert lost the ability to speak in 2006 due to a post-cancer surgery tracheostomy, the film critic has communicated via Post-It notes, an eloquent and hilarious array of hand gestures, and his Mac laptop synthesizer. The version that read out pre-typed introductions at his annual film festival in 2009 had an upper-class English accent the British might call "emollient." Ebert and his wife Chaz called it "Sir Laurence" and shortly thereafter replaced it with a more accessible American–accented voice called "Alex." By next year, Ebert may sound even more like himself, courtesy of personalized voice work being carried out by the Edinburgh-based company CereProc (short for cerebral processing and pronounced "serra-prock").
Ebert's extensive media recordings—not the least of which is the long-running TV series At the Movies—had led many people to suggest something like this. In his autobiography, Life Itself: A Memoir (Grand Central Publishing), set to release on September 13, Ebert says the cost was prohibitive until he discovered that CereProc, a specialist in regional accents, had built personalized voices for other individuals. Its Web versions of George W. Bush and Arnold Schwarzenegger, built from found audio samples, seemed promising.
The traditional method of constructing a speech synthesizer—unit selection—involves precisely transcribing hours of recordings and breaking them up into tiny pieces engineers call "phones" that can be stitched back together in different combinations. The joins aren't always smooth, however, creating audible artifacts.
"A lot of engineering work over the last 10 years is how to stop that artifact," says Matthew Aylett, CereProc's chief technical officer. "One way is to make the person speak in a more boring way—when there's less variation it's easier to join. So that has meant inevitably that within the traditional speech synthesis community the voices sound really boring." For reading out bank balances that suffices. But "if you want to read out whole paragraphs of text or longer pieces it can get very wearing," he adds.
CereProc's most intractable problem has been finding good audio. Unit selection's technical limitation is simple: garbage in, garbage out. Ebert talked plenty on his movie review programs, but was frequently interrupted and usually had a film playing on a screen behind him. The original tracks of his DVD commentaries were better, but his excitement and engagement made large parts unusable.
"It would have been easier if he had been more boring and stupid," Aylett says. Other technical difficulties stemmed from differing microphones, equipment and room sounds. "You could hear the change mid-sentence in the first version."
In the future CereProc wants to make personalized synthesizers scalable—that is, to automate their creation. A newer approach, called the Hidden Markov Model Speech Synthesis System (HTS), creates a statistical model of captured sounds over time and then inverts it to produce speech. Aylett compares the process to rendering graphics.
HTS has several advantages. It is more tolerant of noise and transcription errors, and requires less input.
"The problem at the moment with this system is that the output sounds a bit more like synthesis sounded in the 1990s," Aylett says. But he thinks voice building must be made more efficient. "We want to do a Web service where people can record their voice and wind up with a voice automatically," he says. The audio quality won't be as good, but for most purposes it just needs to be understandable.