August 19, 2008 | 8 comments

If You Use the Web, You May Have Already Been Enlisted as a Human Scanner

Those anti-bot security forms that slow you down when you're entering information just might serve a larger purpose

By Adam Hadhazy   

 
reCAPTCHA book, online spam control

FOR YOUR EYES ONLY: Machine transcription devices cannot read some of the smudgy letters written in old books, but people can. The reCAPTCHA Internet security software system lets humans decipher these tricky words.

e-mail print comment

You're just about ready to buy a pair of tickets on Ticketmaster, but before you can take the next step, an annoying box with wavy letters and numbers shows up on your screen. You dutifully enter in what you see—and what a bot presumably can't—in the name of security.

But what you may not know is that you also have helped archivists decipher distorted characters in old books and newspapers so that they can be posted on the Web.

You might think that computer scientists would have figured out a way to get computers to decipher those characters. But they haven't, so instead they've figured out a way to harness all that effort you're making to protect your security. "When you're reading those squiggly characters, you are doing something that computers cannot," says Luis von Ahn, a computer scientist at Carnegie Mellon University (C.M.U.) in Pittsburgh.

Von Ahn and colleagues reported last week in the journal Science that Web users have transcribed the equivalent of 160 books a day—that's more than 440 million words—in the year since researchers kicked off the program. The initiative is similar to "distributed computing" schemes like SETI@home, which take advantage of unused personal computer processing power to sift through signals received from space for those that might be generated by extraterrestrial intelligence or to figure out how proteins fold. But the difference with this system is that people, not processors, do the calculations.

"We are getting people to help us digitize books at the same time they are authenticating themselves as humans," von Ahn says. "Every time people are typing these [answers] out, they are actually taking old books or newspapers and helping to transcribe them."

Other large digitization projects, such as the Google Books Project and the Internet Archive, rely on optical character recognition (OCR) software. Basically, computers take a digital image of a book or newspaper page, then try to discern the individual letters, von Ahn says. But he and other C.M.U. researchers estimate that these programs misinterpret or fail to read up to one out of every five words on weathered, yellowed paper or on pages with faded or smeared ink. Such electronically illegible words and texts must then be manually transcribed by human workers at a relatively high cost, he says.

Von Ahn's team's method is a twist on the Web site tests known as CAPTCHAs (Completely Automated Public Turing test to tell Computers and Humans Apart), which have been in use since 2000. The new twist on CAPTCHAs is to use a set of letters from old, weathered books and newspapers that computerized transcribing programs cannot recognize. Much of the raw "fuel" comes courtesy of the Internet Archive project, which transmits words that its OCRs cannot recognize or do not appear in the dictionary.

About 40,000 Websites now use the service, called reCAPTCHA, which the project's site offers for free. Facebook was one of its first major patrons.

Von Ahn estimates that at reCAPTCHA's current rate of transcription (about four million words a day missed by OCR systems), the program does a week's worth of transcription from 1,500 professional transcribers in a single day. This data is stored on hard drives at C.M.U. and then sent back to the organization that requested the transcription. (The New York Times, for example, has enlisted reCAPTCHA to digitize the newspaper's archives dating back to 1851.)



Read Comments (8) | Post a comment 1 2 Next >


Share
Propeller    Digg!  Reddit delicious  Fark 
Slashdot    RT @sciam If You Use the Web, You May Have Already Been Enlisted as a Human ScannerTwitter Review it on NewsTrust 
sharebar end

You Might Also Like


Discuss This Article


Click here to submit your comment.

VIEW:

2,573 characters remaining
 
  Email me when someone responds to this discussion.
 

risk free issue 

Sciam - cover Email:
Name:
Address:
Address 2:
City:
State:  
spacer




Editor's Pick


Newsletter

Technology Newsletter

Get weekly coverage delivered to your inbox


 Podcasts

  • 60-Second Earth     RSS  · iTunes Capturing Carbon Dioxide
    click to enable

    Download

  • 60-Second Science     RSS  · iTunes Babies Already Have An Accent
    click to enable

    Download





ADVERTISEMENT
 
 


Also on Scientific American


© 1996-2009 Scientific American Inc. All Rights Reserved. Reproduction in whole or in part without permission is prohibited.
ADVERTISEMENT