Tuesday, July 22, 2008

Label Reading with a Purpose

The Calit2 Life blog ran a great story from Jacobs School computer science professor Serge Belongie. The post is republished below. This is an update on a story that I wrote about last fall, when Belongie and his collaborators presented their ideas at a conference.

Soylent Grid Is People!

One of the big challenges in solving large scale object recognition problems is the need to obtain vast amounts of labeled training data. Such data is essential for training computer vision systems based on statistical pattern recognition techniques, for which a single example image of an object is unfortunately not enough.

For my research group, this has been especially evident in our work on the Calit2 GroZi project, which has the goal of developing assistive technology for the visually impaired. This includes tasks such as recognizing products on grocery shelves and reading text in natural scenes. (Check out this YouTube video for a bit of background on the project.)
In the past, this type of labor-intensive data labeling task would fall on hapless grad students or undergrad volunteers. (As an example, last winter my TIES group and CSE graduate student Shiaokai Wang manually labeled all the text on hundreds of product packages, all for the meager reward of pizza and soda.)
Recently, however, a movement has emerged that harnesses Human Computation to solve such labeling tasks using a highly distributed network of human volunteers. As an example, CMU's recaptcha system applies this principle to the task of transcribing old scanned documents, wherein the image quality is low enough to throw off conventional Optical Character Recognition (OCR) software.
Think of it like this. Every time you solve a CAPTCHA, i.e., those distorted words you have to type in at websites like myspace and hotmail to prove that you're not a spambot, you're using your powerful human intelligence to solve a small puzzle. Systems like recaptcha, the Mechanical Turk, and the Soylent Grid (currently under development by Calit2 affiliate Stephan Steinbach, CSE graduate student and CISA3 project member Vincent Rabaud, visiting scholar Valentin Leonardi, and TIES summer scholar and ECE undergraduate Hourieh Fakourfar) seek to redirect this human problem-solving ability toward useful tasks.
Hourieh's summer project has as its aim to adapt our fledgling Soylent Grid prototype to the above-mentioned text annotation task. A critical requirement for such a system to work is a steady traffic of web visitors looking for content.

Some day, when the Soylent Grid is a household name, we'll have strategic partnerships set up with big-name websites that serve up 1000s of CAPTCHAs per hour. Until then, we've got our work cut out for us to find some traffic to get our experiment started. As a humble starting point, we're going to outfit the pdf links on my group's publications page so that people who click on the link get served a labeling task before they can download the pdf. From there, we plan to move on to bigger and better websites with increased levels of traffic.
Now you may ask, how do we prevent visitors from inputting nonsense instead of providing useful annotation? As with recaptcha, the solution is to use a pair of images, one with known (ground truth) annotation, the other unknown. In this way, the visitor's response on the known example can be used to validate the response on the other example. Moreover, the response of multiple visitors on the same image can be pooled to form confidence levels, and when this level is high enough, an image can be moved from the "unknown" stack to the "known" stack.
Naturally, many questions remain. How do we make these labeling tasks sufficiently atomic and easy to complete so that the web visitor doesn't get frustrated? How much ground truth labeling is needed in a given image database to "prime the pump"? How do we deal with ambiguity in the labeling task or in the user input? Some initial thoughts on these and other questions are put forward in Stephan and Vincent's position paper from ICV'07, but there's nothing like a messy real-world experiment to get real-world answers to these questions!