Friday, January 31, 2014

reCAPTCHA-ing the Book


I'm interested in some book digitization projects that are coming out of Carnegie Mellon. Researchers at Carnegie Mellon developed CAPTCHA and reCAPTCHA (as seen below, in case you didn't know what it was called). reCAPTCHA is now what we all use to validate that we are humans, not computer programs. Usually these come up when we are buying something online or registering somewhere. 

Unlike it's predecessor, when you are typing in the two words on reCAPTCHA you are also helping to digitize books. (Let that sink in.) 

How does it work? (This synopsis will do no justice to the reCAPTCHA program, but there are links below for more information.) 

First, the book is scanned. Second, they use a computer software to decipher all the words on the digital image. Sometimes the digital image comes out unclear - words bleed into each other or the page is smudged. Unfortunately, the optical character recognition (OCR) software has a difficult time reading roughly 30% of books over fifty years old (Luis von Ahn, Carnegie Mellon University). So the software takes the words it cannot read and puts them in the reCAPTCHA box. Essentially, they are getting people to read the parts of the book the software can't, for free, little by little. 

One of the words in the reCAPTCHA box is known to the system and the other word is one from a book undergoing digitization. Take the above example. Perhaps the system knows the word Canada, which if you type correctly, will allow you to the other page. The system presumes that if you typed Canada correctly, you also typed blame correctly.  If a certain number of people have agreed the word is spelled "b-l-a-m-e", the word can be said to be digitized accurately. 

Websites that are using reCAPTCHA include Google, Twitter, Ticketmaster, and 350,000 other sites. This means that roughly 100,000,000 words are being digitized daily (Luis von Ahn, Carnegie Mellon University). That equals something around 2,500,000 books a year - one word at a time (Luis von Ahn, Carnegie Mellon University). 

I find it fascinating that researchers have found ways - or are at least thinking about ways - to incorporate things like internet security software to the digitization of print culture. To take it a step further, they found a way to accurately decipher words that OCR can't translate on it's own. That said, I have my reservations about calling it completely accurate. But it's a step in the right direction. 

I've included some information about CAPTCHA and reCAPTCHA in the links below, should you be interested in a better description. The second link below has some critiques about the programs flaws, in you're interested in reCAPTCHA at all. 
http://www.captcha.net/
https://www.cylab.cmu.edu/partners/success-stories/recaptcha.html



NB: The book wheel of the 21st century is something completely different. 


No comments:

Post a Comment