
Image: Illustration by Chris Whetzel
-
The Best Science Writing Online 2012
Showcasing more than fifty of the most provocative, original, and significant online essays from 2011, The Best Science Writing Online 2012 will change the way...
Read More »
In the past couple of years speech-recognition software has quietly grown tendrils into every corner of our lives. It’s at the other end of customer-support hotlines and airline reservation systems. It’s built into Microsoft Windows. It’s an alternative text-input method for touch-screen phones such as the iPhone and the Android. But let’s face it: most people who use this software wish they didn’t have to.
That’s because speech recognition is usually plan B: a least terrible alternative to typing or actual human conversation. Corporations use it for their phone systems because it’s cheaper than hiring real people. Many people who dictate into their computers do it because they must, perhaps because of a disability. And speech recognition is cropping up on touch-screen phones because typing on an on-screen keyboard is slow and fussy.
So what would it take to make speech recognition more than a work-around? How close are we to the Star Trek ideal of conversational computers that never get it wrong?
Well, we’re getting there. It turns out that after a decade of buyouts, mergers and embezzlement scandals, there is only one major speech-recognition company left: Nuance Communications. It sells the only commercial dictation software for Windows, for Macintosh and for iPhone. Its technology drives the voice-command systems in cars from Audi, BMW, Ford and Mercedes and cell phones from Motorola, Nokia, Samsung, Verizon and T-Mobile. It powers voice-activated toys, GPS units and cash machines, and it answers the phone at AT&T, Bank of America, CVS and many others.
Every year Nuance releases another new version of its consumer dictation programs, such as Dragon NaturallySpeaking. Usually it doesn’t add many new features. Instead it devotes most of its resources to a single goal: improving accuracy.
In the beginning, you had to train these programs by reading a 45-minute script into your microphone so that the program could learn your voice. As the technology improved over the years, that training session fell to 20 minutes, to 10, to five—and now you don’t have to train the software at all. You just start dictating, and you get (by my testing) 99.9 percent accuracy. That’s still one word wrong every couple of pages, but it’s impressive.
Speech engineers use all kinds of tricks to boost accuracy. The earliest dictation programs required you to pause after each word; the software had no clue how to distinguish “their” from “there” and “they’re.” But in time, ever more powerful PC processors made continuous-speech analysis possible. Today you are encouraged to speak in longer phrases, so the software has more context to analyze for accuracy.
Another trick: Last year Nuance offered a free dictation app for the iPhone, called Dragon Dictation. What you say is transmitted to the company’s servers, where it is analyzed, converted to text and zapped back to your screen within seconds.
What nobody knew, though, is that the company stored all those millions of speech samples, in effect creating an immense storehouse of different voices, ages, inflections and accents against which to test different recognition algorithms.
So, yes, the technology is improving. But readers often ask me: “If dictation software is so good, can I use it to transcribe phone calls and interviews?”
The answer is still no. The software isn’t much good unless you are speaking into a microphone, without background noise, preferably without an accent. You still have to speak all punctuation (“comma”), like this (“period”). And goodness knows, we humans have enough trouble understanding each other; it’s a bit much to ask for a computer to get it all right. No wonder today’s dictation apps still make mistakes such as “mode import” for “modem port,” “move eclipse” for “movie clips,” and “oak wrap” for—well, you get it.




See what we're tweeting about




7 Comments
Add CommentI think the real problem is related with the concept of "emergence". Language is an area where meaning emerges out of separate words. As we lawyers well know the context and connotations define the meaning you create and a sentence is more than its constituent parts. Computers programmers need to solve this emergence problem in their own way.
Reply | Report Abuse | Link to thisI think it is amazing people get as good of results as they do dictating English with all of its linquistic clumsiness. The amount of research hours on the subject is formidable; the function of language must be well known. Articles of this nature usually make me wish I could be part of an effort to create a language that would be more machinable, and easier to use than current languages that have developed organically. I have examined other ways to write that indicate the sound of the words in the characters (and getting rid of the dots and dashes that make one have to lift their pens while they are writing). Frequently used long sylable words would get fewer sylables - the speed of speech and writing could double. I'm sure this has already been done, but I don't access or knowledge of any project like this; I would like to though.
Reply | Report Abuse | Link to thisThanks to Mr. Pogue. I will try out some of his tips.
Sincerely, Gordon Hoffman
I'm reminded of the fact that Victor Borge mastered the art of dictating punctuation for speech recognition years ago. http://youtu.be/uUm787cz460
Reply | Report Abuse | Link to thisI'm looking for a light-weight screen/pad that I can hang from around my neck for a closed-captioned effect when talking to a deaf friend. I can't use sign-language because my hands are impaired with MS.
Reply | Report Abuse | Link to thisAnyone done this?
Isn't that what LogLan is supposed to accomplish?
Reply | Report Abuse | Link to thisNo, that's Ithkuil, not Loglan.
Reply | Report Abuse | Link to this