
A typical reCAPTCHA
You've
done it so many times, at so many sites across the Internet, that chances are
you don't even think about it anymore: deciphering and typing in a "CAPTCHA," those squiggly, mucked-up words
presented each time you buy tickets online, write a blog comment, or join a
social network. Their purpose is clear: they tell Web sites that you are a
person and not a computer, theoretically cutting down on spam. More perceptive Web users may have noticed that sometimes the garbled strings appear in pairs, with one looking more like it's been scanned out of a library book or old
newspaper, perhaps with some sloppy underlining or stray pen marks. The latter
is a variant known as "reCAPTCHA," and for
two years it has been performing double duty, both authenticating you and
helping to digitize old printed material at the same time. Far from just
wasting your time, it has now helped digitize almost all of the New York Times archives.
Both
CAPTCHA (which stands for Completely Automated Public Turing Test to Tell
Computers and Humans Apart) and reCAPTCHA are the invention of Luis von Ahn, a Carnegie Mellon
computer scientist and MacArthur "genius grant" recipient. "A couple hundred million
CAPTCHAs are typed daily around the world," von Ahn tells NEWSWEEK. "The first
time I did the calculations, I felt quite proud. And then I felt bad because
people really find these annoying." They're also wasteful. It takes about 10 seconds
to type a CAPTCHA─more, obviously, if you err and have to start over─meaning
a total of some 500,000 human hours per day are spent typing them in. As a
point of comparison, according to von Ahn, the Empire State Building took 7 million
human hours to build. "Life is only like 700,000 hours," he says. "It's almost
the equivalent of a life. We thought, is there any way we can use this human
effort in a way that's good for humanity?"
Turns
out, there is. Recognizing distorted words is one of the (dwindling number of) things
that the human brain can still do better than computers. In order to make old books,
newspaper, and other texts searchable, pages are scanned and fed into optical
character-recognition software. Because ink and paper degrade over time, some
words remain inscrutable. The reCAPTCHA system presents Web users with two
words: one word that computers can't read, and one that they can. So long as
you type the known word in correctly, and a few other people agree with you on
the unknown word, you have helped digitize an archival page. And, von Ahn says,
typing in two words instead of one doesn't cost you a significant amount of extra
time.
Von
Ahn is tough to pin down on a number of details─he won't say how much the New York Times pays for reCAPTCHA's
services, nor the precise amount of progress it has made in digitizing its 150 years
of pages. But he hinted in his recent talk
at the 2009 PopTech conference that the
project was on track to finish by the end of 2009 or slightly later.
reCAPTCHA,
which is free for Web sites to implement, is being used by Facebook,
Craigslist, Twitter, and more than 100,000 other sites. In September, it was acquired
by Google, which has massive human proofreading needs in its Google Books
and Google News Archive
projects. At some 40 million deciphered words a day, and approximately 100,000
words per book, that means the reCAPTCHA army could in theory chew through hundreds of thousands of books per year.
It's
been said that we shouldn't ask what's next in terms of what the Internet and
technology will be able to do but instead try to understand what we've already
got and figure out how to put it to good use. Von Ahn's efforts surely prove
that point. They also show that in some ways, we can help computers as much as
they help us.