When Will Speech-Recognition Software Finally Be Good Enough?

Think how much time we’d save if voice assistants always understood commands or questions the first time

Join Our Community of Science Lovers!

Back in 2010 Matt Thompson, then with National Public Radio, forecast in an op-ed that “at some point in the near future, automatic speech transcription will become fast, free, and decent.” He called that moment the “Speakularity,” in a sly reference to inventor Ray Kurzweil's vision of the “singularity,” in which our minds will be uploaded into computers. And Thompson predicted that access to reliable automatic speech-recognition (ASR) software would transform the work of journalists—to say nothing of lawyers, marketers, people with hearing disabilities, and everyone else who deals in both spoken and written language.

Desperate for any technology that would free me from the exhausting process of typing real-time notes during interviews, I was enraptured by Thompson's prediction. But while his brilliant career in radio has continued (he is now editor in chief of the Center for Investigative Reporting's news output, including its show Reveal), the Speakularity seems as far away as ever.

There has been important progress, to be sure. Several start-ups, such as Otter, Sonix, Temi and Trint, offer online services that allow customers to upload digital audio files and, minutes later, receive computer-generated transcripts. In my life as an audio producer, I use these services every day. Their speed keeps increasing, and their cost keeps going down, which is welcome.

On supporting science journalism

If you're enjoying this article, consider supporting our award-winning journalism by subscribing. By purchasing a subscription you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today.

But accuracy is another matter. In 2016 a team at Microsoft Research announced that it had trained its machine-learning algorithms to transcribe speech from a standard corpus of recordings with record-high 94 percent accuracy. Professional human transcriptionists performed no better than the program in Microsoft's tests, which led media outlets to celebrate the arrival of “parity” between humans and software in speech recognition.

The thing is, that last 6 percent makes all the difference. I can tell you from bitter experience that cleaning up a transcript that is 94 percent accurate can take almost as long as transcribing the audio manually. And four years after that breakthrough, services such as Temi still claim no better than 95 percent—and then only for recordings of clear, unaccented speech.

Why is accuracy so important? Well, to take one example, more and more audio producers (including myself) are complying with Internet accessibility guidelines by publishing transcripts of their podcasts—and no one wants to share a transcript in which one in every 20 words contains an error. And think how much time people could save if voice assistants such as Alexa, Bixby, Cortana, Google Assistant and Siri understood every question or command the first time.

ASR systems may never reach 100 percent accuracy. After all, humans do not always speak fluently, even in their native languages. And speech is so full of homophones that comprehension always depends on context. (I have seen transcription services render “iOS” as “ayahuasca” and “your podcast” as “your punk ass.”)

But all I am asking for is a 1 or 2 percent improvement in accuracy. In machine learning, one of the main ways to reduce an algorithm's error rate is to feed it higher-quality training data. It is going to be crucial, therefore, for transcription services to figure out privacy-friendly ways of gathering more such data. Every time I clean up a Trint or Sonix transcript, for example, I am generating new, validated data that could be matched to the original audio and used to improve the models. I would be happy to let the companies use it if it meant there would be fewer errors over time.

Getting such data is surely one path to the Speakularity. Given the growing number of conversations we have with our machines and the increasing amount of audio created every day, we should not be thinking of decent automatic transcription as a luxury or an aspiration anymore. It is an absolute necessity.

It’s Time to Stand Up for Science

If you enjoyed this article, I’d like to ask for your support. Scientific American has served as an advocate for science and industry for 180 years, and right now may be the most critical moment in that two-century history.

I’ve been a Scientific American subscriber since I was 12 years old, and it helped shape the way I look at the world. SciAm always educates and delights me, and inspires a sense of awe for our vast, beautiful universe. I hope it does that for you, too.

If you subscribe to Scientific American, you help ensure that our coverage is centered on meaningful research and discovery; that we have the resources to report on the decisions that threaten labs across the U.S.; and that we support both budding and working scientists at a time when the value of science itself too often goes unrecognized.

In return, you get essential news, captivating podcasts, brilliant infographics, can't-miss newsletters, must-watch videos, challenging games, and the science world's best writing and reporting. You can even gift someone a subscription.

There has never been a more important time for us to stand up and show why science matters. I hope you’ll support us in that mission.

Thank you,

David M. Ewalt, Editor in Chief, Scientific American