Fending off SPAM with reCAPTCHA

August 6, 2010

reCAPTCHA is a free CAPTCHA service that helps to digitize books, newspapers, magazines and all kinds of old publications that can only be found on a printed medium.

The term CAPTCHA stands for “Completely Automated Public Turing test to tell Computers and Humans Apart”, and is a type of challenge-response test used in computing to ensure that the response is generated by a human being. The process involves the use of software that displays some challenging inquiry that computers are unable to decipher. Or at least, that’s the goal. The most common type of CAPTCHA requires that the user type letters or digits from a distorted image that appears on the screen.

However, computers are getting smarter by the day and it’s gotten to the point where the images are being correctly guessed by newly developed software. To counteract these advancements  CAPTCHAS are being made even more difficult to read, with the end result of humans having a hard time trying to guess the correct answer, while spambot software is simply evolving and outperforming any effort to outsmart them by the very humans who designed them in the first place.

The line between what was readable to humans and unreadable to computers was about to be crossed in the wrong directions, when a new CAPTCHA approach named reCAPTCHA emerged from the work of Guatemalan computer scientist Luis von Ahn. An early CAPTCHA developer, von Ahn realized that “he had unwittingly created a system that was frittering away, in ten-second increments, millions of hours of a most precious resource: human brain cycles.”

About 200 million CAPTCHAs are solved by humans around the world every day. In each case, roughly ten seconds of human time are being spent. Individually, that’s not a lot of time, but in aggregate these little puzzles consume more than 150,000 hours of work each day. What if we could make positive use of this human effort? reCAPTCHA does exactly that by channeling the effort spent solving CAPTCHAs online into “reading” books.

To archive human knowledge and to make information more accessible to the world, multiple projects are currently digitizing physical books that were written before the computer age. The book pages are being photographically scanned, and then transformed into text using “Optical Character Recognition” (OCR). The transformation into text is useful because scanning a book only produces large sized images, which are difficult to store, expensive to download, cannot be searched and are inaccessible to visually impaired users. But OCR is far from perfect and if the source text to be read is not in a good condition the rate of accuracy drops significantly.

reCAPTCHA improves the process of digitizing books by sending words that cannot be correctly read by OCR to the Web in the form of CAPTCHAs for humans to decipher. Each of those words that cannot be read correctly by OCR is delivered as an image and used as a CAPTCHA.

Every new word that cannot be read correctly by OCR is given to a user in conjunction with another word for which the answer is already known. The user is then asked to read both words. If they solve the one for which the answer is known, the system assumes their answer is correct for the new one. The image is given to a number of human to determine with higher accuracy the correctness of the answer.

reCAPTCHA uses two layers of security when generating images by distorting even more those images that can’t be read by computers:

As of September 2009 the system is reported to solve 30 million CAPTCHAs every day. Among its subscribers are such popular sites as Facebook, TicketMaster, Twitter, 4chan, StumbleUpon or Craigslist.

On 2009 Google acquired reCAPTCHA and it is using now this technology on a variety of large scale text scanning projects such as Google Books and Google News Archive Search. If you own a website, you can also contribute to a valuable cause, as well as effectively fighting SPAM by implementing reCAPTCHA in your contact forms and exposed e-mail addresses.

More information here:

  1. Sign up for a reCAPTCHA API key.
  2. Read the reCAPTCHA developer’s guide.
  3. Join the reCAPTCHA user’s group.