Talk to the Machine: Progress in Speech-Recognition Software, by David Pogue

Speech-recognition programs are no longer clumsy exercises in futility

Join Our Community of Science Lovers!

In the past couple of years speech-recognition software has quietly grown tendrils into every corner of our lives. It’s at the other end of customer-support hotlines and airline reservation systems. It’s built into Microsoft Windows. It’s an alternative text-input method for touch-screen phones such as the iPhone and the Android. But let’s face it: most people who use this software wish they didn’t have to.

That’s because speech recognition is usually plan B: a least terrible alternative to typing or actual human conversation. Corporations use it for their phone systems because it’s cheaper than hiring real people. Many people who dictate into their computers do it because they must, perhaps because of a disability. And speech recognition is cropping up on touch-screen phones because typing on an on-screen keyboard is slow and fussy.

So what would it take to make speech recognition more than a work-around? How close are we to the Star Trek ideal of conversational computers that never get it wrong?

On supporting science journalism

If you're enjoying this article, consider supporting our award-winning journalism by subscribing. By purchasing a subscription you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today.

Well, we’re getting there. It turns out that after a decade of buyouts, mergers and embezzlement scandals, there is only one major speech-recognition company left: Nuance Communications. It sells the only commercial dictation software for Windows, for Macintosh and for iPhone. Its technology drives the voice-command systems in cars from Audi, BMW, Ford and Mercedes and cell phones from Motorola, Nokia, Samsung, Verizon and T-Mobile. It powers voice-activated toys, GPS units and cash machines, and it answers the phone at AT&T, Bank of America, CVS and many others.

Every year Nuance releases another new version of its consumer dictation programs, such as Dragon NaturallySpeaking. Usually it doesn’t add many new features. Instead it devotes most of its resources to a single goal: improving accuracy.

In the beginning, you had to train these programs by reading a 45-minute script into your microphone so that the program could learn your voice. As the technology improved over the years, that training session fell to 20 minutes, to 10, to five—and now you don’t have to train the software at all. You just start dictating, and you get (by my testing) 99.9 percent accuracy. That’s still one word wrong every couple of pages, but it’s impressive.

Speech engineers use all kinds of tricks to boost accuracy. The earliest dictation programs required you to pause after each word; the software had no clue how to distinguish “their” from “there” and “they’re.” But in time, ever more powerful PC processors made continuous-speech analysis possible. Today you are encouraged to speak in longer phrases, so the software has more context to analyze for accuracy.

Another trick: Last year Nuance offered a free dictation app for the iPhone, called Dragon Dictation. What you say is transmitted to the company’s servers, where it is analyzed, converted to text and zapped back to your screen within seconds.

What nobody knew, though, is that the company stored all those millions of speech samples, in effect creating an immense storehouse of different voices, ages, inflections and accents against which to test different recognition algorithms.

So, yes, the technology is improving. But readers often ask me: “If dictation software is so good, can I use it to transcribe phone calls and interviews?”

The answer is still no. The software isn’t much good unless you are speaking into a microphone, without background noise, preferably without an accent. You still have to speak all punctuation (“comma”), like this (“period”). And goodness knows, we humans have enough trouble understanding each other; it’s a bit much to ask for a computer to get it all right. No wonder today’s dictation apps still make mistakes such as “mode import” for “modem port,” “move eclipse” for “movie clips,” and “oak wrap” for—well, you get it.

So, no, the keyboard isn’t going away in our lifetime. Conversational-style Star Trek computing is still decades away. Sure, 99.9 percent accuracy is darned good—but until it reaches 100, speech-recognition technology is still plan B.

It’s Time to Stand Up for Science

If you enjoyed this article, I’d like to ask for your support. Scientific American has served as an advocate for science and industry for 180 years, and right now may be the most critical moment in that two-century history.

I’ve been a Scientific American subscriber since I was 12 years old, and it helped shape the way I look at the world. SciAm always educates and delights me, and inspires a sense of awe for our vast, beautiful universe. I hope it does that for you, too.

If you subscribe to Scientific American, you help ensure that our coverage is centered on meaningful research and discovery; that we have the resources to report on the decisions that threaten labs across the U.S.; and that we support both budding and working scientists at a time when the value of science itself too often goes unrecognized.

In return, you get essential news, captivating podcasts, brilliant infographics, can't-miss newsletters, must-watch videos, challenging games, and the science world's best writing and reporting. You can even gift someone a subscription.

There has never been a more important time for us to stand up and show why science matters. I hope you’ll support us in that mission.

Thank you,

David M. Ewalt, Editor in Chief, Scientific American