You're just about ready to buy a pair of tickets on Ticketmaster, but before you can take the next step, an annoying box with wavy letters and numbers shows up on your screen. You dutifully enter in what you see—and what a bot presumably can't—in the name of security.
But what you may not know is that you also have helped archivists decipher distorted characters in old books and newspapers so that they can be posted on the Web.
You might think that computer scientists would have figured out a way to get computers to decipher those characters. But they haven't, so instead they've figured out a way to harness all that effort you're making to protect your security. "When you're reading those squiggly characters, you are doing something that computers cannot," says Luis von Ahn, a computer scientist at Carnegie Mellon University (C.M.U.) in Pittsburgh.
Von Ahn and colleagues reported last week in the journal Science that Web users have transcribed the equivalent of 160 books a day—that's more than 440 million words—in the year since researchers kicked off the program. The initiative is similar to "distributed computing" schemes like SETI@home, which take advantage of unused personal computer processing power to sift through signals received from space for those that might be generated by extraterrestrial intelligence or to figure out how proteins fold. But the difference with this system is that people, not processors, do the calculations.
"We are getting people to help us digitize books at the same time they are authenticating themselves as humans," von Ahn says. "Every time people are typing these [answers] out, they are actually taking old books or newspapers and helping to transcribe them."
Other large digitization projects, such as the Google Books Project and the Internet Archive, rely on optical character recognition (OCR) software. Basically, computers take a digital image of a book or newspaper page, then try to discern the individual letters, von Ahn says. But he and other C.M.U. researchers estimate that these programs misinterpret or fail to read up to one out of every five words on weathered, yellowed paper or on pages with faded or smeared ink. Such electronically illegible words and texts must then be manually transcribed by human workers at a relatively high cost, he says.
Von Ahn's team's method is a twist on the Web site tests known as CAPTCHAs (Completely Automated Public Turing test to tell Computers and Humans Apart), which have been in use since 2000. The new twist on CAPTCHAs is to use a set of letters from old, weathered books and newspapers that computerized transcribing programs cannot recognize. Much of the raw "fuel" comes courtesy of the Internet Archive project, which transmits words that its OCRs cannot recognize or do not appear in the dictionary.
Von Ahn estimates that at reCAPTCHA's current rate of transcription (about four million words a day missed by OCR systems), the program does a week's worth of transcription from 1,500 professional transcribers in a single day. This data is stored on hard drives at C.M.U. and then sent back to the organization that requested the transcription. (The New York Times, for example, has enlisted reCAPTCHA to digitize the newspaper's archives dating back to 1851.)
Von Ahn acknowledges that the overall cost for reCAPTCHA is still a bit higher than just using OCR for more recently written, more easily scanned texts. He would not say exactly how much, citing nondisclosure agreements with clients using the software.
When the researchers compared how reCAPTCHA and OCR transcribed five Times articles, reCAPTCHA did a significantly better job—99.1 percent accuracy—than OCR of the sort that Google uses for its book project, which came in at 83.5 percent. (Google declined to comment for this story.)
But as is the way with most technology, today's innovation is tomorrow's VHS tape. Eventually computers will be able to decipher reCAPTCHAs, too. "We'll get a few good years out of reCAPTCHAs," says co-author Manuel Blum, a professor of computer science at Carnegie Mellon and key developer of some of the first CAPTCHAs.
OCR will continue to improve as well, Blum says, along with so-called machine learning in general.
Either way, with some 100 million books published prior to the dawn of the digital era, says von Ahn, that "makes for a lot of words."