Visitors to YouTube, which now boasts the Internet's second-largest search engine, have uploaded hundreds of millions of videos since its launch in early 2005. For most people YouTube (Google bought the video-sharing site for $1.65 billion in late 2006) is a valuable outlet for sharing personal videos, catching up on college lectures, consulting "how-to" clips and absorbing pop-culture nuggets like "Weird Al" Yankovic's parody of Lady Gaga. Until recently, however, the tens of millions of deaf and hearing-impaired (in the U.S. alone) could not take full advantage of YouTube because they were getting only half of the experience. Google and YouTube engineers are working to fix this by improving software that can automatically add captions to all videos, although this has been a difficult process.

Google's mission is to organize the world's information, and a lot of that information on the Web is spoken rather than written, says the company's research scientist Mike Cohen, who joined Google in 2004 to head up speech technology development. Television, which introduced closed-captioning in the early 1970s and made it more widely available throughout the 1980s, in many ways has had an edge over the Web in serving the needs of the deaf, he adds.

Throughout much of the deaf community, "there was a feeling that after spending years winning the legal battles to have television programming captioned, all of a sudden the world had moved to YouTube," Cohen says. "We wanted to re-win that battle for them in a way that's scalable; it had to be done with technology rather than using humans to input captions with each video."

Reading along
Google introduced the ability to manually add captions to videos on its Google Video site in 2006 and in 2008 added captioning to YouTube. Google introduced machine-generated automatic captions to YouTube in November 2009 and has since sought to improve the technology with the help of speech-recognition modeling software and lots of data. Thus far, more than 60 million videos have been auto-captioned, according to Google.

The company's speech-recognition model has acoustic, lexicon and language components. The acoustic portion is a statistical model of the basic sounds made in spoken language (all of the vowels and consonants, for example). This is a large and complex model because those sounds often vary based on context (that is, where a speaker is raised and the dialect spoken), Cohen says.

The lexicon is basically a list of words in a given language and data about how they are pronounced (consider the two vowel sounds that are acceptable in pronouncing the "e" in "economics," for example). "For something like voice search we have a vocabulary of about one million words with the right pronunciations for those words and the variations in pronunciation," Cohen says.

The language component of Google's speech-recognition model is a statistical model of all of the phrases and sentences that might be used within a language. This helps the auto-captioning function analyze how different words are often grouped together (the word "go," for example, is often followed by the word "to") and predict probable pairings based on that information.

Much of the speech-recognition technology is tuned for the English language, although the company plans to expand auto-captioning to additional languages. For now, YouTube serves its global audience by translating auto-captions into more than 50 languages.

But does it work?
Auto-captioning is an easy sell to the deaf community because it affords them access to more of YouTube. Yet, this feature is often frustrating for deaf users, who find little use for video on the Web if the captions are not accurate. "I love the idea of auto-captioning because it allows me to understand many of YouTube's clips that I [otherwise] would not have," says Arielle Schacter, a 17-year-old junior at The Chapin School in New York City. Schacter, who is hard of hearing, adds, "The reality, however, is that the auto-captioning is often wrong. Instead of being able to read the actual dialogue, I am forced to view nonsensical statements or letters/numbers."

Auto-captions can become jumbled for several reasons, in particular source separation. The software must distinguish different wave forms in an audio stream to find the dialogue that needs to be captioned, as opposed to background music or conversation. This is particularly difficult because many videos uploaded to YouTube have poor audio quality and a lot background noise, says Michiel Bacchiani, a senior Google staff research scientist specializing in speech recognition. "This is what YouTube is working to improve," he adds.

Auto-captioning also has difficulty transcribing language with very specialized words, such as those used during an academic lecture, Cohen says, adding, "These words aren't part of the common vocabulary, but if they're missed, you miss a lot of the meaning of the lecture."

Learning curve
Google claims that the most recent version of its auto-captioning software has reduced error rates by 20 percent. Indeed, an early version of the software could not recognize the word "YouTube" when it was used in videos, says Ken Harrenstien, the technical lead on the YouTube captioning project. Harrenstien, who is deaf, is the principle engineer behind the infrastructure that serves, manages and displays captions and a primary motivating force for the company's captioning projects.

Harrenstien recounts that most of the team working on the captioning project was "extremely concerned" about the quality of the first auto-captions. "I kept telling them over and over and over that, as one of the potential beneficiaries, I would be ecstatic to see even the most inaccurate captions generated by our algorithms," he says. "Most people don't realize that TV captioning for live events [such as sports] is generated by humans but can still often be atrocious to the point of illegibility. Still, if you know the context and have a good grasp of puns and homonyms, you have a shot at figuring out what's going on—and it's a lot better than nothing."

Despite the difficulty generating highly accurate auto-captions, Harrenstien says he was confident from the beginning that YouTube's automatic speech-recognition algorithms would improve over time and that the more auto-captions were used on the site, the more likely the company's engineers would be given an opportunity to improve the technology. "It works as well as we can make it, and I love it for that reason," he adds. "It is not perfect, does not pretend to be perfect, and may never be perfect, but it's a stake in the cliff we're continuing to climb."

The best way to improve auto-captioning accuracy in such a way that it can be used by the millions of videos on YouTube is to feed more data to a larger, richer model of spoken language, essentially training the YouTube software to better interpret spoken words and place them in context, Cohen says.

In the near term there are other ways to improve caption quality. People posting to YouTube can download auto-captions added to their videos, correct any errors and then re-upload the captions to YouTube. Or they can upload their videos with captions already in place, Harrenstien says, noting one clear incentive—accurately captioned videos get "many, many more views globally."

Are you smarter than a machine? Enter our Great Consciousness Contest: