Auto-captions can become jumbled for several reasons, in particular source separation. The software must distinguish different wave forms in an audio stream to find the dialogue that needs to be captioned, as opposed to background music or conversation. This is particularly difficult because many videos uploaded to YouTube have poor audio quality and a lot background noise, says Michiel Bacchiani, a senior Google staff research scientist specializing in speech recognition. "This is what YouTube is working to improve," he adds.
Auto-captioning also has difficulty transcribing language with very specialized words, such as those used during an academic lecture, Cohen says, adding, "These words aren't part of the common vocabulary, but if they're missed, you miss a lot of the meaning of the lecture."
Google claims that the most recent version of its auto-captioning software has reduced error rates by 20 percent. Indeed, an early version of the software could not recognize the word "YouTube" when it was used in videos, says Ken Harrenstien, the technical lead on the YouTube captioning project. Harrenstien, who is deaf, is the principle engineer behind the infrastructure that serves, manages and displays captions and a primary motivating force for the company's captioning projects.
Harrenstien recounts that most of the team working on the captioning project was "extremely concerned" about the quality of the first auto-captions. "I kept telling them over and over and over that, as one of the potential beneficiaries, I would be ecstatic to see even the most inaccurate captions generated by our algorithms," he says. "Most people don't realize that TV captioning for live events [such as sports] is generated by humans but can still often be atrocious to the point of illegibility. Still, if you know the context and have a good grasp of puns and homonyms, you have a shot at figuring out what's going on—and it's a lot better than nothing."
Despite the difficulty generating highly accurate auto-captions, Harrenstien says he was confident from the beginning that YouTube's automatic speech-recognition algorithms would improve over time and that the more auto-captions were used on the site, the more likely the company's engineers would be given an opportunity to improve the technology. "It works as well as we can make it, and I love it for that reason," he adds. "It is not perfect, does not pretend to be perfect, and may never be perfect, but it's a stake in the cliff we're continuing to climb."
The best way to improve auto-captioning accuracy in such a way that it can be used by the millions of videos on YouTube is to feed more data to a larger, richer model of spoken language, essentially training the YouTube software to better interpret spoken words and place them in context, Cohen says.
In the near term there are other ways to improve caption quality. People posting to YouTube can download auto-captions added to their videos, correct any errors and then re-upload the captions to YouTube. Or they can upload their videos with captions already in place, Harrenstien says, noting one clear incentive—accurately captioned videos get "many, many more views globally."
Are you smarter than a machine? Enter our Great Consciousness Contest: http://bit.ly/ke4n3L