Detecting Soda Cans Against a Cluttered Background

Thursday, June 08, 2006

Keypoint Matching With Real Frames

I tried applying the keypoint-matching portion of the SIFT paper to an input image but I did not achieve very good results. I matched two input images to my database of several hundred different soda-can images and typically achieved results as shown in the first picture. This seems to indicate that when a cluttered background is present, matching individual keypoints does not produce reliable results. The second image shows all of the keypoints found in the input image, which is around 400. It appears that keypoints are being generated for each soda-can in the input image but these are not being determined to be the closest-matching keypoints in the scene.

In order to further determine what the problem is, I tried matching a database image with a frame containing exactly the same image of the soda can in the database. Though this is definitely not a way to test a Computer Vision algorithm for classification ability, it gave some indication about why the poor matching results were occurring. Even though the same image was used, not all of the points were matched. This indicates that analyzing the constellation of points is very important when dealing with a cluttered background. Otherwise, the background could have so much variety to it that keypoints arising in it would match better than keypoints in the correct object. Lowe describes in his SIFT paper a process of using clusters of results from a generalized Hough Transform. Although Lowe achieved good matching results using just his keypoint matching algorithm, he used high-resolution images where the keypoints within the objects in question were numerous and apparently more distinguishable from the background, as seen with a book-matching example at this link: http://www.cs.ubc.ca/~lowe/keypoints/

The next step that I take might be to implement the generalized Hough Transform that Lowe mentions, followed by a pose-estimation. One concern I have is that there are not enough keypoints that can be matched to a database of images. In an earlier blog posting, I showed matching results from comparing high-resolution images to low-resolution images that were scaled-down and oriented differently. These results showed that sometimes there were no keypoint matches between different scales of soda cans. If I have to add many different soda-can scales as well as orientations to my database of images, then the matching algorithm might become prohibitively inefficient and the accuracy level might drop even lower.

I might try using PCA-SIFT to extract keypoints. This algorithm replaces the "smoothed-histogram" that Lowe uses in his keypoint creation stages with PCA (principal component analysis). The PCA-SIFT keypoints are supposedly better at handling noisy images but are more prone to localization error. I was also thinking about using the features described in the Video Google paper, but there is no indication as to how efficient the computation of the "visual words" is. The Video Google paper apparently assumes that the "visual words" are already computed in a video before objects are classified.

Perhaps a long-term solution to object detection using mobile robots should be to simply detect the presence of an object with no regard as to what type of object it is. Objects of a certain size, color, dimension etc. could then be inspected more closely where some SIFT variation could be applied. Perhaps eventually an entire scene could be constructed into a 3D model and while this process is taking place, some easy to recognize objects could be classified before the entire 3D reconstruction completes.

I think that for now, I will implement the generalized Hough Transform and the other stages mentioned in Lowe's SIFT paper and see under what conditions soda cans can be recognized well. I will then be able to better determine if enough keypoints are being generated. I will also be able to determine where Adaboost fits in the detection/recognition process.

Wednesday, June 07, 2006

Removing The Background of Training Images

I recently wrote a program to white-out the background of my soda can training images. Since I wanted to reuse the training set from my experiments with Adaboost, I had to implement two different stages to accomplish this.

The first stage involved writing a Perl script to reformat the polygon files that were created by the "Javimap" program I used. This script read all of the polygons in all of the files and merged them into a single text file. Thanks to Perl's build-in regular expression matching capabilities, this process worked quite well.

The second stage involved writing a C program to cut out each soda can subimage from the training images and white-out its background. I chose C because David Lowe's provided code is written in C and contains some functions for manipulating PGM files (which he chose for his SIFT implementation).

In order to white-out the background of the soda-can images, I checked each pixel to see if it was within the polygon that I specified when first creating the training images. To accomplish this task, I implemented an algorithm that was provided on this website for raytracing:

http://neptune.cs.byu.edu/egbert/455/CourseNotes/Lecture13-RayTracingIntersections.ppt

My entire algorithm went as follows:

1) read each polygon and load its corresponding image from the Perl-generated text file of polygon coordinates.

2) cycle through each set of polygon coordinates to find the largest and smallest pair of (x,y). Use these values to specify a bounding-box for each soda can.

3) Using the bounding box from (2), crop out the image within this bounding box (should be a single soda can).

4) Translate all polygon coordinates by subtracting the smaller bounding box coordinate from each (so they are with respect to (0,0) rather than the smaller bounding-box coordinate. This is because a new image was created with the bounding box region.)

5) For each pixel in the subimage, check if it is within the bounding polygon. If not, then set that pixel value to 1.0, which is white. To determine if a pixel is within the polygon, the polygon is translated to a new coordinate system where the pixel in question is at the origin. The edges of the polygon are then checked to see if they intersect the x-axis. If the number of intersections is odd, then the pixel is inside the polygon. If the number of intersections is even, then the pixel is not inside the polygon. More details on this algorithm can be found at the aforementioned website.

Now that I have created an algorithm to automatically extract training images with clutter-free backgrounds, I can redo the training set very easily if needed without having to manually white-out the background with Gimp.

The next step will be to proceed with matching input test images against my new database of soda-can images. I will then try applying additional concepts from the SIFT paper to filter out false-positives etc.

Saturday, May 27, 2006

Learning Perl

It took me quite a while to extract the number of keypoints detected and insert them into a spreadsheet. To make future tasks easier, I decided to learn Perl scripting. I read over some tutorials and books. From what I've learned about Perl, I created a script to remove keypoints and their corresponding images from the database it the number of keypoints detected in them is less than 3. This is accomplished by parsing the output text file from detecting keypoints in the database directory. Now that I can make Perl scripts, I should be able to accomplish many tasks faster by automating the more tedious work.

Here is a sample from the output file from creating database keypoints:

extracting keypoints: ./database/s06.pgm -> ./database/s06.key
Finding keypoints...
22 keypoints found.
extracting keypoints: ./database/ss01.pgm -> ./database/ss01.key
Finding keypoints...
2 keypoints found.
extracting keypoints: ./database/ss02.pgm -> ./database/ss02.key
Finding keypoints...
3 keypoints found.
extracting keypoints: ./database/ss03.pgm -> ./database/ss03.key
Finding keypoints...
0 keypoints found.

Here is the Perl script I wrote. It is called by passing in the name of the output text file ( a sample of which is above). The # signs represent comments:
#!usr/bin/perl
my $rmkp = ">./rmkp.txt";
my $line;
my $prevLine;
#@ARGV[0]: output text filename from keypoint detection.
open ( FILE, "@ARGV[0]") or
die "Cannot open input file: @ARGV[0] : $!\n";
while( $line = ) {
# i means case insensitive
# ([^>]*) match zero or more characters but not '>'
# if line matches regexp
if( $line =~ m/extracting keypoints:([^>]*)>/ ){
#removing words
$line =~ s/extracting keypoints:([^>]*)>//g;
#remove file extension
$line =~ s/.key//g;
#remove whitespace
$line =~ s/\s//g;
#store current line
$prevLine = $line;
}
#if number of keypoints found, store number
if($line =~ m/keypoints found./){
$line =~ s/keypoints found.//g;
$line =~ s/\n//g;
#if less than 3 keypoints
if($line <>

Here is the log file, rmkp.txt, generated by the Perl script. This shows the number of keypoints for the associated keypoint and image files that were deleted:
Files Deleted From Database:

2 ./database/ss01.key ./database/ss01.pgm
0 ./database/ss03.key ./database/ss03.pgm

Wednesday, May 24, 2006

Analysis of Keypoint Matching

Image 1: table of keypoint matches (rows on the left matched with columns on the top)
Image 2: table of keypoints where entries with <2 or entries where an image was matched to itself were set to zero.
Image 3: a graph of the table in Image 2.

Today, I graphed the matching of keypoints. The x-axis of the graph above represents the keypoints from the input images. The y-axis represents the keypoints from the database. The colors indicate how many successful keypoint matches occurred. I removed all results that had fewer than 2 matching keypoints or results where the exact same keypoints were matched between identical images. One strange result is that certain images match very well when, for instance, "match image1 image2" is run and the same images yield no matches at all when "match image2 image1" is run. For instance, the image s05.pgm has 64 matches when matched with 04.pgm while no matches occur when 04.pgm is matched with s05.pgm. Because of s05.pgm's angle and small resolution of around 120x120 pixels, very few stable keypoints are found. In contrast, 04.pgm has a high resolution with around 400 keypoints found. It could be that s05.pgm's keypoints are very similar to each other because of the low pixel resolution of the image. When matched with the keypoints from 04.pgm, perhaps there are enough keypoints in 04.pgm that are close to the keypoints in s05.pgm that enough keypoints match to indicate a successful classification.

Though this table does not provide a complete set of test cases, it can be inferred that the higher resolution images match better with other high-resolution images than low to high, low to low, or high to low. I believe that the high number of matches between the high resolution images to s05.pgm is an anomoly and does not indicate that this particular image can reliably be compared to others.

If too many keypoints are stored in the database images, the computation time required to match new input images to these might be too great for real-time performance. Since Lowe was able to achieve good results using 600x480 pixel input images with smaller resolution training images, I will try doing the same. Now that I am more familiar with SIFT, I will re-read Lowe's SIFT paper. I will try to jump the gun and use the methods provided in Lowe's code to detect soda cans using footage from Robart III as the next step. This way, I will have a better idea about what problems arise. I will then look over Lowe's suggestions for narrowing down the range of useful keypoint matches using a Hough Transform etc.

Monday, May 22, 2006

Matching Keypoints

Today, I matched soda-can keypoints to each other. Conveniently, C-code for doing this is provided with David Lowe's Keypoint generation code. Using the above set of images, I found keypoints for each image and matched them with the keypoints from all of the other images. The images with a prefix of "s" are the images without an "s" that have been scaled down to 3% of their original size and then scaled up by 600%. The resulting images with prefix "s" are of a size of around 150x240 pixels. The other images have a size of about 1100x2100 pixels. This was done as a test to see how much of an effect the resolution of the images has on extracting keypoints, which angles have the most matchable keypoints, and if similar resolution soda-can images match better than dissimilar resolution ones. I will post the results from this small experiment along with a table and a graph on my next blog entry.

This is how keypoint matching works between two images, referred to as A and B:

1. keypoints are found in A.
2. keypoints are found in B.
3. The closest matching keypoints are found between the images. Each keypoint is a 128 dimension vector. To find the distance between two keypoints, the Euclidean distance is found between the feature vector s belonging to the keypoints. Each keypoint in A is compared with each keypoint in B by Euclidean distance to find the closest matching keypoint. A match is determined as follows:

Let a' be a point in A. Let b' and b'' be the first and second closest matching points in B to the point a'. Let D(x,y) represent the Euclidean distance between its arguments x and y.

If D(a', b')< D(a', b'')*0.6 then the closest point is chosen.
Else, no points are chosen (no matching keypoint).

According to Lowe, finding merely the closest matching keypoint does not yield a reliable match. Lowe claims that a more reliable match is determined by comparing the second-closest matching keypoint. He does this by using the second-closest matching keypoint as a threshold.

Wednesday, May 10, 2006

SIFT Keypoints: preliminary results

One question has arised with regards to keypoints: will soda cans in low-resolution video footage yield enough keypoints to perform SIFT effectively?

Running the Keypoint Detector

I downloaded the keypoint detection code from David Lowe's website:
http://www.cs.ubc.ca/~lowe/keypoints/

David Lowe is the creator of SIFT. The code is in binary format and was extremely easy to run in Linux. There are apparently no required libraries or dependencies to run this code. I took some photos with a digital camera and cropped out the regions containing soda cans from the background. From what I've seen so far, it appears that SIFT detects keypoints that represent characteristics within an object. I believe this means that the outline of a soda can cannot be taken into account for object detection. This may prove to be a problem if objects to be detected do not have many detailed features. To test the validity of this theory, I would like to try spray-painting a soda can a solid color and see if any SIFT keypoints are extracted. Here are some images of detected keypoints:

SCALED TO DOWN TO 1.5%:

SCALED TO DOWN TO 1.5% THEN SCALED UP TO 600%:

SCALED TO DOWN TO 3.0%:

SCALED TO DOWN TO 3.0% THEN SCALED UP TO 600%:

These cans all started out as high resolution images. These images were then scaled down to 1.5% or 3.0% of their resolution in an attempt to simulate soda cans at a typical scale encountered in live video footage. According to Lowe's README file that came with the keypoint detection code, more keypoints can be detected if an image is first scaled to a larger size. In fact, the number of detected keypoints typically doubled when a low resolution image was scaled up. The pepsi cans with a prominant logo yielded the most keypoints. The soda can in the top-most image yielded no keypoints at all. There are many cases where the specular highlights appeared to yield keypoints. These cases would be taken care of at a later stage where keypoints from different training images taken under different lighting conditions would be analyzed to determine which keypoints are invariant to scale and lighting. The more I think about it, the more appealing a hybrid solution to object detection seems --one that would incorporate data from multiple sources to arrive at a conclusion. For instance, perhaps some sort of texture recognition or color detection method could be used in conjunction with SIFT to detect objects with sparse and dense features. I will continue to investigate how SIFT can be applied to soda can detection and find out what image resolutions I have to work with. If SIFT proves to work well for objects up close, I might use some other methods to detect objects of interest and then proceed to zoom-in on those objects before applying SIFT.

The next step will be to test out keypoint matching between these detected keypoints.

New MileStones: Evaluating SIFT

I have a new set of Milestones to accomplish for this quarter. I'm very busy this quarter so I will try to accomplish as many of them as I can.

Understand SIFT thoroughly - meet by 5/10/06

The purpose of this milestone is to understand SIFT well enough to be able to set up an appropriate training set and to plan out speciﬁc implementation details. This will include locating existing SIFT-related source code and determining what software should be used (i.e. shell-scripting, ImageMagick, Matlab, OpenCV, C++ etc.). Last quarter, I implemented everything in Matlab first before attempting to implement it in C++. This quarter, I will be giving C/C++ preference and use Matlab to aid in my understanding of SIFT or to generate charts, graphs etc. I've found shell scripting to be incredibly helpful in all aspects of programming --not just for this research.

Create the Training Set - meet by 5/17/06

This will most likely involve modifying my existing training set. This training set consists of extracted images of soda cans acquired at a single scale, against a static background, under constant lightining conditions, and two separate orientations. Although the use of this training set will most likely not result in scale-invariant keypoints, it will be useful in testing other aspects of this SIFT algorithm directly against my Adaboost implementation. Invariance to background clutter, lighting, performance while in motion, and efﬁciency can be tested using this training set. After a working SIFT implementation is created, a training set that includes soda cans extracted under a wider variety of conditions can later be swapped with this training set. I figure that I will go through all of the steps required to apply SIFT using only a few images. Once this is done, I will have a better idea about how to create a good training set for SIFT. Last quarter, I had to redo my training set several times so I want to avoid having to do that this quarter.

Detect Keypoints- meet by 5/24/06

This will involve using a difference-of-gaussian function to search the scale-space of the soda can training images to identify scale-invariant interest points. The most stable keypoints will then be selected and assigned orientations based on their local gradient directions. These gradients will then be transformed into a representation that is resistant to shape distortion and illumination differences. This will be accomplished using an existing binary implementation, such as the version provided by David Lowe on his website. I will probably make a primitive algorithm for matching keypoints initially and then improve upon it.

Match Keypoints of New Input Images (no background clutter) - meet by 5/31/06

A database of keypoints from the training images will ﬁrst be set up. A new input image will be matched with this database by ﬁnding the Euclidean distance between the feature vector of the input image and the feature vectors in the database. The new input images will include soda can images with backgrounds that have been manually removed so that no keypoints will be detected outside of the soda can regions. If background clutter is left
in, then keypoints extracted from these regions may yield so many negative classiﬁcations that false negative classiﬁcations will result even when a soda can is present in a particular region. Without background clutter, I can test keypoint matching without having to cluster the keypoints or apply any additional transformations to them.

Identify Keypoint Clusters, Apply Further Filtering of Keypoint Matches (background clutter added) - meet by 6/7/06

This will ﬁrst ﬁlter correct keypoint matches from the false matches by identifying clusters of three or more keypoints that agree on the orientation, scale, and location of the detected object. This is described in Lowe’s SIFT paper with a Hough Transform offered as a solution to identifying these clusters. Next, a least-squares estimate will be applied for an afﬁne approximation to the object pose. Further analysis can be applied, as described by Lowe, to determine the probability that a particular set of features is the object being searched for.

Run Detection Method Live - meet by 6/14/06

Once a SIFT implementation is successfully implemented, it will be tested using live video footage. Initially, I might try using the same setup that I used before with the same static background. The results from this test will be compared with my previous results based on the efﬁciency and accuracy of detection. Various conditions will then be changed, such as the ambient lighting. Video footage will then be tested while the camera is moving horizontally with respect to the soda cans. The complexity of the tests will be gradually increased and the results analyzed unless/until the SIFT implementation begins to perform poorly. If time permits, a new set of training images will be acquired using soda cans positioned in a variety of locations and orientations and this SIFT implementation will be retested. Depending on the availability of SIFT source code online, some of these milestones can be combined and their theoretical completion dates moved back or forward.

Tuesday, April 18, 2006

Testing Soda Can Detection on Robart III

http://members.cox.net/fbirchmore/goodrun.mpg
http://members.cox.net/fbirchmore/goodrun2.mpg
http://members.cox.net/fbirchmore/goodrun3.mpg

We ran my soda can detection algorithm on live video from Robart III. Apparently, the classifier I have trained is very susceptible to variations in lighting and background clutter that was not present in the training data. Regardless of the aforementioned deficiencies in this particular classifier, it does quite well at eliminating almost all of the background clutter in the center of the image (only a central portion of the window is scanned as mentioned in a previous post). These false positives could be passed to another detection method for further processing or to signal to Robart III that he should zoom in on these areas for further inspection. From the results I have observed so far, I think weak classifiers that are better at detecting unique features of soda cans would work better than the haar-like features used. I think that a SIFT-based detection method will work much better , partly because it uses keypoints that would be more invariant to changes in lighting. Investigating a SIFT-based detection method will probably be the next step I will take. Although I still need to read more about SIFT to understand it fully, I am going to investigate the possibility of incorporating Adaboost into a SIFT detection method. More information on this will appear in later blog postings.