Cover Image: December 2010 Scientific American Magazine See Inside

Talk to the Machine: Progress in Speech-Recognition Software, by David Pogue

Speech-recognition programs are no longer clumsy exercises in futility















Share on Tumblr



Image: Illustration by Chris Whetzel

In the past couple of years speech-recognition software has quietly grown tendrils into every corner of our lives. It’s at the other end of customer-support hotlines and airline reservation systems. It’s built into Microsoft Windows. It’s an alternative text-input method for touch-screen phones such as the iPhone and the Android. But let’s face it: most people who use this software wish they didn’t have to.

That’s because speech recognition is usually plan B: a least terrible alternative to typing or actual human conversation. Corporations use it for their phone systems because it’s cheaper than hiring real people. Many people who dictate into their computers do it because they must, perhaps because of a disability. And speech recognition is cropping up on touch-screen phones because typing on an on-screen key­board is slow and fussy.

So what would it take to make speech recognition more than a work-around? How close are we to the Star Trek ideal of conversational computers that never get it wrong?

Well, we’re getting there. It turns out that after a decade of buyouts, mergers and embezzlement scandals, there is only one major speech-recognition company left: Nuance Communications. It sells the only commercial dictation software for Windows, for Macintosh and for iPhone. Its technology drives the voice-command systems in cars from Audi, BMW, Ford and Mercedes and cell phones from Motorola, Nokia, Samsung, Verizon and T-Mobile. It powers voice-activated toys, GPS units and cash machines, and it answers the phone at AT&T, Bank of America, CVS and many others.

Every year Nuance releases another new version of its consumer dictation programs, such as Dragon NaturallySpeaking. Usually it doesn’t add many new features. Instead it devotes most of its resources to a single goal: improving accuracy.

In the beginning, you had to train these programs by reading a 45-minute script into your microphone so that the program could learn your voice. As the technology improved over the years, that training session fell to 20 minutes, to 10, to five—and now you don’t have to train the software at all. You just start dictating, and you get (by my testing) 99.9 percent accuracy. That’s still one word wrong every couple of pages, but it’s impressive.

Speech engineers use all kinds of tricks to boost accuracy. The earliest dictation programs required you to pause after each word; the software had no clue how to distinguish “their” from “there” and “they’re.” But in time, ever more powerful PC proc­essors made continuous-speech anal­ysis possible. Today you are encouraged to speak in longer phrases, so the software has more context to analyze for accuracy.

Another trick: Last year Nuance offered a free dictation app for the iPhone, called Dragon Dictation. What you say is transmitted to the company’s servers, where it is analyzed, converted to text and zapped back to your screen within seconds.

What nobody knew, though, is that the company stored all those millions of speech samples, in effect creating an immense storehouse of different voices, ages, inflections and accents against which to test different recognition algorithms.

So, yes, the technology is improving. But readers often ask me: “If dictation software is so good, can I use it to transcribe phone calls and interviews?”

The answer is still no. The soft­ware isn’t much good unless you are speaking into a microphone, without background noise, preferably without an accent. You still have to speak all punctuation (“comma”), like this (“period”). And goodness knows, we humans have enough trouble understanding each other; it’s a bit much to ask for a computer to get it all right. No wonder today’s dictation apps still make mistakes such as “mode import” for “modem port,” “move eclipse” for “movie clips,” and “oak wrap” for—well, you get it.



7 Comments

Add Comment
View
  1. 1. gesimsek 07:12 PM 11/18/10

    I think the real problem is related with the concept of "emergence". Language is an area where meaning emerges out of separate words. As we lawyers well know the context and connotations define the meaning you create and a sentence is more than its constituent parts. Computers programmers need to solve this emergence problem in their own way.

    Reply | Report Abuse | Link to this
  2. 2. gordienj 01:07 PM 12/5/10

    Reply | Report Abuse | Link to this
  3. 3. gordienj 01:33 PM 12/5/10

    I think it is amazing people get as good of results as they do dictating English with all of its linquistic clumsiness. The amount of research hours on the subject is formidable; the function of language must be well known. Articles of this nature usually make me wish I could be part of an effort to create a language that would be more machinable, and easier to use than current languages that have developed organically. I have examined other ways to write that indicate the sound of the words in the characters (and getting rid of the dots and dashes that make one have to lift their pens while they are writing). Frequently used long sylable words would get fewer sylables - the speed of speech and writing could double. I'm sure this has already been done, but I don't access or knowledge of any project like this; I would like to though.

    Thanks to Mr. Pogue. I will try out some of his tips.
    Sincerely, Gordon Hoffman

    Reply | Report Abuse | Link to this
  4. 4. coop125 09:56 AM 12/20/10

    I'm reminded of the fact that Victor Borge mastered the art of dictating punctuation for speech recognition years ago. http://youtu.be/uUm787cz460

    Reply | Report Abuse | Link to this
  5. 5. Hermit 04:24 PM 12/20/10

    I'm looking for a light-weight screen/pad that I can hang from around my neck for a closed-captioned effect when talking to a deaf friend. I can't use sign-language because my hands are impaired with MS.

    Anyone done this?

    Reply | Report Abuse | Link to this
  6. 6. bucketofsquid in reply to gordienj 09:59 AM 12/22/10

    Isn't that what LogLan is supposed to accomplish?

    Reply | Report Abuse | Link to this
  7. 7. JDahiya in reply to bucketofsquid 07:49 AM 1/18/11

    No, that's Ithkuil, not Loglan.

    Reply | Report Abuse | Link to this
Leave this field empty

Add a Comment

You must sign in or register as a ScientificAmerican.com member to submit a comment.
Click one of the buttons below to register using an existing Social Account.

More from Scientific American

See what we're tweeting about

Scientific American Editors

More »

Free Newsletters


Get the best from Scientific American in your inbox

Solve Innovation Challenges

Powered By: Innocentive

  SA Digital
  SA Digital

Science Jobs of the Week

Email this Article

Talk to the Machine: Progress in Speech-Recognition Software, by David Pogue: Scientific American Magazine

X
Scientific American Magazine

Subscribe Today

Save 66% off the cover price and get a free gift!

Learn More >>

X

Please Log In

Forgot: Password

X

Account Linking

Welcome, . Do you have an existing ScientificAmerican.com account?

Yes, please link my existing account with for quick, secure access.



Forgot Password?

No, I would like to create a new account with my profile information.

Create Account
X

Report Abuse

Are you sure?

X

Institutional Access

It has been identified that the institution you are trying to access this article from has institutional site license access to Scientific American on nature.com. To access this article in its entirety through site license access, click below.

Site license access
X

Error

X

Share this Article

X