December 9, 2020

Artificial Intelligence Is Now Shockingly Good at Sounding Human

Synthetic voices have become ubiquitous. They feed us directions in the morning, shepherd us through phone calls by day and broadcast the news on smart speakers at night. And as the technology used to make them improves, these voices are becoming more and more human-sounding. This is the final frontier in synthetic speech: replicating not just what we say but how we say it.

Rupal Patel heads a research group at Northeastern University that studies speech prosody—the changes in pitch, loudness and duration that we use to convey intent and emotion through voice. “Sometimes people think of it as the icing on the cake,” she explains. “You have the message, and now it’s how you modulate that message, but I really think it’s the scaffolding that gives meaning to the message itself.”

Patel says she grew interested in prosody after finding it was the only element of vocal communication that seemed to be available to people with some kinds of severe speech disorders. These patients were able to make expressive sounds even if they could not speak clearly. In 2014 Patel founded a company to build custom synthetic voices for nonspeaking individuals. VocaliD has since expanded to commercial brands and influencers.

Synthetic speech has come a long way over the years. At age nine, Siri is the oldest virtual assistant—but in the world of speaking machines, she’s a baby. People have been trying to synthesize speech since at least the 18th century, when an Austro-Hungarian inventor built a crude replica of the human vocal tract that could articulate entire phrases (albeit in a monotone).

Current machine-learning techniques can model human speech, complete with awkward pauses and lip smacks. Still, training on thousands of samples per second is prohibitively expensive for most real-world systems. Researchers, including those at VocaliD, are continually implementing newer and more efficient methods.

But even as the remaining gaps between human and synthetic speech are steadily closing, truly lifelike prosody continues to elude even the most sophisticated systems. Maybe what’s still missing requires machines not only to mimic humans but also to feel like us.

By Meghan McDonough

Join Our Community of Science Lovers!

It’s Time to Stand Up for Science

If you enjoyed this article, I’d like to ask for your support. Scientific American has served as an advocate for science and industry for 180 years, and right now may be the most critical moment in that two-century history.

I’ve been a Scientific American subscriber since I was 12 years old, and it helped shape the way I look at the world. SciAm always educates and delights me, and inspires a sense of awe for our vast, beautiful universe. I hope it does that for you, too.

If you subscribe to Scientific American, you help ensure that our coverage is centered on meaningful research and discovery; that we have the resources to report on the decisions that threaten labs across the U.S.; and that we support both budding and working scientists at a time when the value of science itself too often goes unrecognized.

In return, you get essential news, captivating podcasts, brilliant infographics, can't-miss newsletters, must-watch videos, challenging games, and the science world's best writing and reporting. You can even gift someone a subscription.

There has never been a more important time for us to stand up and show why science matters. I hope you’ll support us in that mission.

Thank you,

David M. Ewalt, Editor in Chief, Scientific American