New AI Tech Can Mimic Any Voice

Emerging technologies in speech generation raise ethics and security concerns

Even the most natural-sounding computerized voices—whether it’s Apple’s Siri or Amazon’s Alexa—still sound like, well, computers. Montreal-based start-up Lyrebird is looking to change that with an artificially intelligent system that learns to mimic a person’s voice by analyzing speech recordings and the corresponding text transcripts as well as identifying the relationships between them. Introduced last week, Lyrebird’s speech synthesis can generate thousands of sentences per second—significantly faster than existing methods—and mimic just about any voice, an advancement that raises ethical questions about how the technology might be used and misused.

The ability to generate natural-sounding speech has long been a core challenge for computer programs that transform text into spoken words. Artificial intelligence (AI) personal assistants such as Siri, Alexa, Microsoft’s Cortana and the Google Assistant all use text-to-speech software to create a more convenient interface with their users. Those systems work by cobbling together words and phrases from prerecorded files of one particular voice. Switching to a different voice—such as having Alexa sound like a man—requires a new audio file containing every possible word the device might need to communicate with users.

Lyrebird’s system can learn the pronunciations of characters, phonemes and words in any voice by listening to hours of spoken audio. From there it can extrapolate to generate completely new sentences and even add different intonations and emotions. Key to Lyrebird’s approach are artificial neural networks—which use algorithms designed to help them function like a human brain—that rely on deep-learning techniques to transform bits of sound into speech. A neural network takes in data and learns patterns by strengthening connections between layered neuronlike units.

On supporting science journalism

If you're enjoying this article, consider supporting our award-winning journalism by subscribing. By purchasing a subscription you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today.

After learning how to generate speechthe system can then adapt to any voice based on only a one-minute sample of someone’s speech. “Different voices share a lot of information,” says Lyrebird co-founder Alexandre de Brébisson, a PhD student at the Montreal Institute for Learning Algorithms laboratory at the University of Montreal. “After having learned several speakers’ voices, learning a whole new speaker's voice is much faster. That’s why we don’t need so much data to learn a completely new voice. More data will still definitely help, yet one minute is enough to capture a lot of the voice ‘DNA.’”

Lyrebird showcased its system using the voices of U.S. political figures Donald Trump, Barack Obama and Hillary Clinton in a synthesized conversation about the start-up itself.The company plans to sell the system to developers for use in a wide range of applications, including personal AI assistants, audio book narration and speech synthesis for people with disabilities.

Last year Google-owned company DeepMind revealed its own speech-synthesis system, called WaveNet, which learns from listening to hours of raw audio to generate sound waves similar to a human voice. It then can read a text out loud with a humanlike voice.Both Lyrebird and WaveNet use deep learning, but the underlying models are different, de Brébisson says. “Lyrebird is significantly faster than WaveNet at generation time,” he says. “We can generate thousands of sentences in one second, which is crucial for real-time applications. Lyrebird also adds the possibility of copying a voice very fast and is language-agnostic.” Scientific American reached out to DeepMind but was told WaveNet team members were not available for comment.

Lyrebird’s speed comes with a trade-off, however. Timo Baumann, a researcher who works on speech processing at the Language Technologies Institute at Carnegie Mellon University and is not involved in the start-up, noted Lyrebird’s generated voice carries a buzzing noise and a faint but noticeable robotic sheen. Moreover, it does not generate breathing or mouth movement sounds, which are common in natural speaking. “Sounds like lip smack and inbreathe are important in conversation. They actually carry meaning and are observable to the listener,” Baumann says. These flaws make it possible to distinguish the computer-generated speech from genuine speech, he adds. We still have a few years before technology can get to a point that it could copy a voice convincingly in real-time, he adds.

Still, to untrained ears and unsuspecting minds, an AI-generated audio clip could seem genuine, creating ethical and security concerns about impersonation. Such a technology might also confuse and undermine voice-based verification systems. Another concern is that it could render unusable voice and video recordings used as evidence in court. A technology that can be used to quickly manipulate audio will even call into question the veracity of real-time video in live streams. And in an era of fake news it can only compound existing problems with identifying sources of information. “It will probably be still possible to find out when audio has been tampered with,” Baumann says, “but I’m not saying that everybody will check.”

Systems equipped with a humanlike voice may also pose less obvious but equally problematic risks. For example, users may trust these systems more than they should, giving out personal information or accepting purchasing advice from a device, treating it like a friend rather than a product that belongs to a company and serves its interests. “Compared to text, voice is just much more natural and intimate to us,” Baumann says.

Lyrebird acknowledges these concerns and essentially issues a warning in the brief “ethics” statement on the company’s Web site. Lyrebird cautions the public that the software could be used to manipulate audio recordings used as evidence in court or to assume someone else’s identity. “We hope that everyone will soon be aware that such technology exists and that copying the voice of someone else is possible,” according to the site.

Just as people have learned photographs cannot be fully trusted in the age of Photoshop, they may need to get used to the idea that speech can be faked. There is currently no way to prevent the technology from being used to make fraudulent audio, says Bruce Schneier, a security technologist and lecturer in public policy at the Kennedy School of Government at Harvard University. The risk of encountering a fake audio clip has now become “the new reality,” he says.

It’s Time to Stand Up for Science

If you enjoyed this article, I’d like to ask for your support. Scientific American has served as an advocate for science and industry for 180 years, and right now may be the most critical moment in that two-century history.

I’ve been a Scientific American subscriber since I was 12 years old, and it helped shape the way I look at the world. SciAm always educates and delights me, and inspires a sense of awe for our vast, beautiful universe. I hope it does that for you, too.

If you subscribe to Scientific American, you help ensure that our coverage is centered on meaningful research and discovery; that we have the resources to report on the decisions that threaten labs across the U.S.; and that we support both budding and working scientists at a time when the value of science itself too often goes unrecognized.

In return, you get essential news, captivating podcasts, brilliant infographics, can't-miss newsletters, must-watch videos, challenging games, and the science world's best writing and reporting. You can even gift someone a subscription.

There has never been a more important time for us to stand up and show why science matters. I hope you’ll support us in that mission.

Thank you,

David M. Ewalt, Editor in Chief, Scientific American