Rapid Image Search with CNN Fingerprints

Overview

An important task in the field of image analysis is finding similar images within a massive database. Image similarity search can be used, for example, to count all of the solar panels in the United States from satellite imagery and do a total energy output analysis, or to search through the Earth's oceans to find and follow whales as they migrate. If you're dealing with image databases on the terabyte scale, it is not really feasible to compare one image directly with every other image in the database. Instead, we can compress all of our images to information-dense fingerpints, and do the comparison among these fingerprints.

In this project, a ~70 GB dataset of aerial images is fed through a pretrained, state-of-the-art convolutional neural network (CNN), and the activation of an intermediate layer of this network is interpreted as a low dimensional, information rich fingerprint for the original image. Similarity between two fingerprints (1-dimensional vectors) is then calculated as the euclidean distance. The fingerprints offer a 175x data size reduction from the original images, so that the k-most-similar images to a query can be found in just over 3 seconds on a single CPU, despite the initial dataset being 70 GB.


Methods

Data Collection and Preprocessing

The image database used for this project was a collection of high-resolution drone images from the drone deploy git repo. The database contains 60 .TIF images of neighborhoods, natural scenes, and cities with a resolution of 10cm per pixel. When pixel values in the .TIFs are converted to 32-bit floats, the dataset has a total size of 70 GB.

These high-resolution drone images are then cut into smaller tiles, each of size 256 x 256 pixels. The tiling is acheived with a 128 pixel sliding window, so that image motifs are likely to be well centered in at least one tile. Tiles with >50% null pixel values are removed from the dataset.

A small portion of an example .TIF image is shown the the right. The original image is 680 MB in size, and was processed into 3,836 tiles.

Compressing Images to CNN-Generated Fingerprints

Tiles are compressed to low-dimensional fingerprints by being fed through a pretrained CNN. For this project, I used ResNet-18, which has the architecture shown to the left. ResNet-18 was trained to classify RGB images into 1000 different classes, and the intermediate features it learns to complete this task are useful for many computer vision tasks, including identifying similar motifs in geosatellite imagery.

Because we want ResNet to generalize to our satellite images (not just the 1000 classes it is trained to recognize), we remove the last layer of the network, and take the activations of the average_pool layer as each tile's fingerprint. This effectively reduces each tile, initially ~200,000 floats, down to only 512 floats, giving a theoretical max data compression of 384x.

Once fingerprints are generated for all tiles in the dataset, the user selects a query image, and the k-most-similar images to the query can be returned as the k fingerprints with the least euclidean distance from the query fingerprint.


Results

Here is an example of what the algorithm produces, with the query tile shown on the left and the 10 most similar images returned on the right:

In terms of data compression, the fingerprints for all image tiles are stored in a hash table of structure {tile_unique_identifier : fingerprint}, so that a fingerprint can be quickly related back to its initial tile. This hash table was implemented as a python dictionary, and has a size of 400 MB, meaning that we achieve a data compression rate of 175x relative to the initial image dataset of size 70 GB. This data compression rate means that image similarity searches can be done with much less computational cost.

Actually finding the k-most-similar tiles to a query tile requires a euclidean distance calculation between the query fingerprint and all other fingerprints. This search has a time complexity that grows linearly with the amount of tiles. We did speed tests on my CPU, and found that the 10 most similar images to a query can be computed in an average of 3.2 seconds (95% CI 2.8 - 3.6s). This means that the entire 70 GB dataset can be searched on a single CPU in a reasonable amount of time.

Obviously, all of these results only matter if the algorithm returns images that actually look visually similar to the query. This is a difficult thing to quantify, but the qualitative results on image similarity are certainly promising! Scroll back up to the top of this page to see a selection of 7 random queries, and the 10 most similar images for those queries. These results were not cherry-picked because they looked good. Instead, the query images were chosen randomly.

Code

The code for this project is available on github, and there is also a written paper with a lot more detail that you can check out too! Find the links for both of these things below.

Code on GithubDownload Writeup