Getting Voice: New Speech Synthesis Could Make Roger Ebert Sound More Like Himself

The approach to create a more authentic voice for the film critic will attempt to blend two processes: unit selection and the Hidden Markov Model Speech Synthesis System

By Wendy M. Grossman

After Roger Ebert lost the ability to speak in 2006 due to a post-cancer surgery tracheostomy, the film critic has communicated via Post-It notes, an eloquent and hilarious array of hand gestures, and his Mac laptop synthesizer. The version that read out pre-typed introductions at his annual film festival in 2009 had an upper-class English accent the British might call "emollient." Ebert and his wife Chaz called it "Sir Laurence" and shortly thereafter replaced it with a more accessible American–accented voice called "Alex." By next year, Ebert may sound even more like himself, courtesy of personalized voice work being carried out by the Edinburgh-based company CereProc (short for cerebral processing and pronounced "serra-prock").

Ebert's extensive media recordings—not the least of which is the long-running TV series At the Movies—had led many people to suggest something like this. In his autobiography, Life Itself: A Memoir(Grand Central Publishing), set to release on September 13, Ebert says the cost was prohibitive until he discovered that CereProc, a specialist in regional accents, had built personalized voices for other individuals. Its Web versions of George W. Bush and Arnold Schwarzenegger, built from found audio samples, seemed promising.

The traditional method of constructing a speech synthesizer—unit selection—involves precisely transcribing hours of recordings and breaking them up into tiny pieces engineers call "phones" that can be stitched back together in different combinations. The joins aren't always smooth, however, creating audible artifacts.

On supporting science journalism

If you're enjoying this article, consider supporting our award-winning journalism by subscribing. By purchasing a subscription you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today.

"A lot of engineering work over the last 10 years is how to stop that artifact," says Matthew Aylett, CereProc's chief technical officer. "One way is to make the person speak in a more boring way—when there's less variation it's easier to join. So that has meant inevitably that within the traditional speech synthesis community the voices sound really boring." For reading out bank balances that suffices. But "if you want to read out whole paragraphs of text or longer pieces it can get very wearing," he adds.

CereProc's most intractable problem has been finding good audio. Unit selection's technical limitation is simple: garbage in, garbage out. Ebert talked plenty on his movie review programs, but was frequently interrupted and usually had a film playing on a screen behind him. The original tracks of his DVD commentaries were better, but his excitement and engagement made large parts unusable.

"It would have been easier if he had been more boring and stupid," Aylett says. Other technical difficulties stemmed from differing microphones, equipment and room sounds. "You could hear the change mid-sentence in the first version."

In the future CereProc wants to make personalized synthesizers scalable—that is, to automate their creation. A newer approach, called the Hidden Markov Model Speech Synthesis System (HTS), creates a statistical model of captured sounds over time and then inverts it to produce speech. Aylett compares the process to rendering graphics.

HTS has several advantages. It is more tolerant of noise and transcription errors, and requires less input.

"The problem at the moment with this system is that the output sounds a bit more like synthesis sounded in the 1990s," Aylett says. But he thinks voice building must be made more efficient. "We want to do a Web service where people can record their voice and wind up with a voice automatically," he says. The audio quality won't be as good, but for most purposes it just needs to be understandable.

Ebert, however, would like broadcast quality, a tougher challenge that is spurring CereProc to consider a hybrid approach that uses the HTS model to select among stored phones, generating only less-common sounds that are missing or poorly represented in the database.

"It's great to have someone prominent like this [as a test case]—it moves the technology forward for us and makes it more obvious to other people that it can be done," Aylett says. His inner engineer has been piqued: "I just want to solve this problem."A small sample of the work in progress debuted on Oprah last year, but the date for a finished version is still uncertain.

The time it takes to type out speech will still hamper real-time conversation. Says Aylett, "It gives you real humility as an engineer when you realize that what you're competing against is a Post-It note."

A final question will only be answered when Ebert's new voice goes into service. Will it trigger "the uncanny valley," that is, the human revulsion to robots that are the wrong amount of almost human?

"I doubt if it will ever be a problem," Ebert says via e-mail, "but if it is, it's one I'd like to have."

It’s Time to Stand Up for Science

If you enjoyed this article, I’d like to ask for your support. Scientific American has served as an advocate for science and industry for 180 years, and right now may be the most critical moment in that two-century history.

I’ve been a Scientific American subscriber since I was 12 years old, and it helped shape the way I look at the world. SciAm always educates and delights me, and inspires a sense of awe for our vast, beautiful universe. I hope it does that for you, too.

If you subscribe to Scientific American, you help ensure that our coverage is centered on meaningful research and discovery; that we have the resources to report on the decisions that threaten labs across the U.S.; and that we support both budding and working scientists at a time when the value of science itself too often goes unrecognized.

In return, you get essential news, captivating podcasts, brilliant infographics, can't-miss newsletters, must-watch videos, challenging games, and the science world's best writing and reporting. You can even gift someone a subscription.

There has never been a more important time for us to stand up and show why science matters. I hope you’ll support us in that mission.

Thank you,

David M. Ewalt, Editor in Chief, Scientific American