Deciphering Old Texts, One Woozy, Curvy Word at a Time
By GUY GUGLIOTTA
March 28, 2011
In the old days, anybody interested in seeing a Mets game during a trip to New York would have to call the team, or write away, or wait to get to the city and visit the box office. No more. Now, all it takes is to find an online ticket distributor. Sign in, click “Mets,” pick the date and pay.
But before taking the money, the Web site might first present the reader with two sets of wavy, distorted letters and ask for a transcription. These things are called Captchas, and only humans can read them. Captchas ensure that robots do not hack secure Web sites.
What Web readers do not know, however, is that they have also been enlisted in a project to transform an old book, magazine, newspaper or pamphlet into an accurate, searchable and easily sortable computer text file.
One of the wavy words quite likely came from a digitized image from an old, musty text, and while the original page has already been scanned into an online database, the scanning programs made a lot of mistakes. Mets fans and other Web site users are correcting them. Buy a ticket to the ballgame, help preserve history.
The set of software tools that accomplishes this feat is called reCaptcha and was developed by a team of researchers led by Luis von Ahn, a computer scientist at Carnegie Mellon University.
Its pilot project was to clean up the digitized archive of The New York Times. Today it has become the principal method used by Google to authenticate text in Google Books, its vast project to digitize and disseminate rare and out-of-print texts on the Internet.
Digitization is normally a three-stage process: create a photographic image of the text, also known as a bitmap; encode the text in a compact, easily handled and searchable form using optical character recognition software, commonly called O.C.R.; and, finally, correct the mistakes.
Today’s technology makes the first two steps relatively straightforward. The third, however, can be extremely difficult. For vintage 19th-century texts in English, O.C.R. programs mess up or miss 10 percent to 30 percent of the words. Only humans can fix the errors. The standard method, called key and verify, uses two transcribers to type the text independently and compares the results. This is time-consuming and extremely expensive.
But in 2006, Dr. von Ahn’s team figured out a way around this obstacle. The ubiquitous Captchas, familiar to even the most casual Web user, were the perfect tools. Captchas, short for “completely automated public Turing test to tell computers and humans apart,” are impossible for machines to decipher, but easy for humans. (The test is named for the British computer pioneer Alan Turing.)
Dr. von Ahn’s group estimated that humans around the world decode at least 200 million Captchas per day, at 10 seconds per Captcha. This works out to about 500,000 hours per day — a lot of applied brainpower being spent on what Dr. von Ahn regards as a fundamentally mindless exercise.
“So we asked, ‘Can we do something useful with this time?’ ” Dr. von Ahn recalled in a telephone interview. Instead of making Captchas out of random words printed in a woozy way, why not ask Web users to translate problem words from archival texts?
By Dr. von Ahn’s estimate, reCaptcha is being used by 70 percent to 90 percent of Web sites that have Captchas — including Ticketmaster, Facebook and local bank branches.
Google bought Dr. von Ahn’s start-up in 2009 — he will not say how much it paid — and put it to work on Google Books. He says “several million” words are being translated every day.
The Times, published since 1851, had already optically transcribed its archive when it contacted Dr. von Ahn. Robert Larson, the company’s vice president for search products, said the paper had “looked at various ways” to edit the text, “but Luis’s method was faster and cheaper.”
Page images, particularly those printed before 1900, are loaded with smudges, stains, watermarks and crooked type, all of which give O.C.R.’s the fits. To fix the errors, Dr. von Ahn uses a number of programs, which when applied in the proper sequence magically transform troubled passages into easy-to-read prose.
The first step is done in-house. Two different O.C.R. programs scan the photographic image. Both will make mistakes, but not necessarily the same mistakes.
ReCaptcha flags as “suspicious” any word that is deciphered differently by the two programs or that does not appear in an English dictionary. The dictionary catches words that are misspelled the same way by both O.C.R.’s. Other programs examine the words on either side of the suspect word and make another guess based on that analysis.
Then each suspicious word is turned into a Captcha. It is crucial to understand that the Captcha is a distorted version of the word as printed in the original photographic image. It is not made from the O.C.R.’s imagined translation, which is often unintelligible. The unknown word is then paired with a second Captcha word whose correct translation is already known. This is the “control.”
Several Web users seeking entry to secure sites are then given both words and asked to decipher them separately.
A correct answer for the control word proves that the user is a human and not a machine. Answers for the unknown word are compared with the O.C.R. guesses and the context analysis. If the system is satisfied that the answer is correct, then the game is over.
Dr. von Ahn acknowledged that some words cannot be transcribed, usually because the original text is torn or damaged in some other way. If enough users fail to identify an unknown, the word is deemed to be indecipherable and is marked as such.
ReCaptcha also fails badly on cursive, Dr. von Ahn said, adding that “nobody reads handwriting anymore.” And reCaptcha so far translates only English words, even though many reCaptcha Web sites have overseas clients whose users are not necessarily English speakers.
With all these constraints, reCaptcha nevertheless achieves an accuracy rate above 99 percent, which compares favorably with professional human transcribers. And Dr. von Ahn is convinced that performance will improve with experience, of which there will be no shortage.
“We’ll be going for a long time,” he said. “There’s a lot of printed material out there.”