Visitors to YouTube, which now boasts the Internet's second-largest search engine, have uploaded hundreds of millions of videos since its launch in early 2005. For most people YouTube (Google bought the video-sharing site for $1.65 billion in late 2006) is a valuable outlet for sharing personal videos, catching up on college lectures, consulting "how-to" clips and absorbing pop-culture nuggets like "Weird Al" Yankovic's parody of Lady Gaga. Until recently, however, the tens of millions of deaf and hearing-impaired (in the U.S. alone) could not take full advantage of YouTube because they were getting only half of the experience. Google and YouTube engineers are working to fix this by improving software that can automatically add captions to all videos, although this has been a difficult process.
Google's mission is to organize the world's information, and a lot of that information on the Web is spoken rather than written, says the company's research scientist Mike Cohen, who joined Google in 2004 to head up speech technology development. Television, which introduced closed-captioning in the early 1970s and made it more widely available throughout the 1980s, in many ways has had an edge over the Web in serving the needs of the deaf, he adds.
Throughout much of the deaf community, "there was a feeling that after spending years winning the legal battles to have television programming captioned, all of a sudden the world had moved to YouTube," Cohen says. "We wanted to re-win that battle for them in a way that's scalable; it had to be done with technology rather than using humans to input captions with each video."
Google introduced the ability to manually add captions to videos on its Google Video site in 2006 and in 2008 added captioning to YouTube. Google introduced machine-generated automatic captions to YouTube in November 2009 and has since sought to improve the technology with the help of speech-recognition modeling software and lots of data. Thus far, more than 60 million videos have been auto-captioned, according to Google.
The company's speech-recognition model has acoustic, lexicon and language components. The acoustic portion is a statistical model of the basic sounds made in spoken language (all of the vowels and consonants, for example). This is a large and complex model because those sounds often vary based on context (that is, where a speaker is raised and the dialect spoken), Cohen says.
The lexicon is basically a list of words in a given language and data about how they are pronounced (consider the two vowel sounds that are acceptable in pronouncing the "e" in "economics," for example). "For something like voice search we have a vocabulary of about one million words with the right pronunciations for those words and the variations in pronunciation," Cohen says.
The language component of Google's speech-recognition model is a statistical model of all of the phrases and sentences that might be used within a language. This helps the auto-captioning function analyze how different words are often grouped together (the word "go," for example, is often followed by the word "to") and predict probable pairings based on that information.
Much of the speech-recognition technology is tuned for the English language, although the company plans to expand auto-captioning to additional languages. For now, YouTube serves its global audience by translating auto-captions into more than 50 languages.
But does it work?
Auto-captioning is an easy sell to the deaf community because it affords them access to more of YouTube. Yet, this feature is often frustrating for deaf users, who find little use for video on the Web if the captions are not accurate. "I love the idea of auto-captioning because it allows me to understand many of YouTube's clips that I [otherwise] would not have," says Arielle Schacter, a 17-year-old junior at The Chapin School in New York City. Schacter, who is hard of hearing, adds, "The reality, however, is that the auto-captioning is often wrong. Instead of being able to read the actual dialogue, I am forced to view nonsensical statements or letters/numbers."