How Speech-Recognition Software Discriminates against Minority Voices

Until programmers recognize their own internal biases, the software they create will be problematic

Join Our Community of Science Lovers!

“Clow-dia,” I say once. Twice. A third time. Defeated, I say the Americanized version of my name: “Claw-dee-ah.” Finally, Siri recognizes it. Having to adapt our way of speaking to interact with speech-recognition technologies is a familiar experience for people whose first language is not English or who do not have conventionally American-sounding names. I have now stopped using Siri, Apple's voice-based virtual assistant, because of it.

The growth of this tech in the past decade—not just Siri but Alexa and Cortana and others—has unveiled a problem in it: racial bias. One recent study, published in the Proceedings of the National Academy of Sciences USA, showed that speech-recognition programs are biased against Black speakers. On average, the authors found, all five programs from leading technology companies, including Apple and Microsoft, showed significant race disparities; they were roughly twice as likely to incorrectly transcribe audio from Black speakers compared with white speakers.

This effectively censors voices that are not part of the “standard” languages or accents used to create these technologies. “I don't get to negotiate with these devices unless I adapt my language patterns,” says Halcyon Lawrence, an assistant professor of technical communication and information design at Towson University, who was not part of the study. “That is problematic.” For Lawrence, who has a Trinidad and Tobagonian accent, or for me as a Puerto Rican, part of our identity comes from speaking a particular language, having an accent or using a set of speech forms such as African American Vernacular English (AAVE). Having to change such an integral part of an identity to be able to be recognized is inherently cruel.

On supporting science journalism

If you're enjoying this article, consider supporting our award-winning journalism by subscribing. By purchasing a subscription you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today.

The inability to be understood impacts other marginalized communities, such as people with visual or movement disabilities who rely on voice recognition and speech-to-text tools, says Allison Koenecke, a computational graduate student and first author of the PNAS study. For someone with a disability who is dependent on these technologies, being misunderstood could have serious consequences. There are probably many culprits for these disparities, but Koenecke points to the most likely: the data used for training, which are predominantly from white, native speakers of American English. By using databases that are narrow both in the words that are used and how they are said, training systems exclude accents and other ways of speaking that have unique linguistic features. Humans, presumably including those who create these technologies, have accent and language biases. For example, research shows that the presence of an accent affects whether jurors find people guilty and whether patients find their doctors competent.

Recognizing these biases would be an important way to avoid implementing them in technologies. But developing more inclusive technology takes time, effort and money, and often the decision to invest these are market-driven. (In response to several queries, only a Google spokesperson responded in time for publication, saying, in part, “We've been working on the challenge of accurately recognizing variations of speech for several years and will continue to do so.”)

Safiya Noble, an associate professor of information studies at the University of California, Los Angeles, admits that it's a tricky challenge. “Language is contextual,” says Noble, who was not involved in the study. “But that doesn't mean that companies shouldn't strive to decrease bias and disparities.” To do this, they need the input of humanists and social scientists who understand how language actually works.

From the tech side, feeding more diverse training data into the programs could close this gap, Koenecke says. Noble adds that tech companies should also test their products more widely and have more diverse workforces so people from different backgrounds and perspectives can directly influence the design of speech technologies. Koenecke suggests that automated speech-recognition companies use the PNAS study as a preliminary benchmark and keep using it to assess their systems over time.

In the meantime, many of us will continue to struggle between identity and being understood when interacting with Alexa, Cortana or Siri. But Lawrence chooses identity every time: “I'm not switching,” she says. “I'm not doing it.”

It’s Time to Stand Up for Science

If you enjoyed this article, I’d like to ask for your support. Scientific American has served as an advocate for science and industry for 180 years, and right now may be the most critical moment in that two-century history.

I’ve been a Scientific American subscriber since I was 12 years old, and it helped shape the way I look at the world. SciAm always educates and delights me, and inspires a sense of awe for our vast, beautiful universe. I hope it does that for you, too.

If you subscribe to Scientific American, you help ensure that our coverage is centered on meaningful research and discovery; that we have the resources to report on the decisions that threaten labs across the U.S.; and that we support both budding and working scientists at a time when the value of science itself too often goes unrecognized.

In return, you get essential news, captivating podcasts, brilliant infographics, can't-miss newsletters, must-watch videos, challenging games, and the science world's best writing and reporting. You can even gift someone a subscription.

There has never been a more important time for us to stand up and show why science matters. I hope you’ll support us in that mission.

Thank you,

David M. Ewalt, Editor in Chief, Scientific American