Search code examples
tensorflowimage-processingkeraspytorchcbir

Reverse Image search (for image duplicates) on local computer


I have a bunch of poor quality photos that I extracted from a pdf. Somebody I know has the good quality photo's somewhere on her computer(Mac), but it's my understanding that it will be difficult to find them.

I would like to

  • loop through each poor quality photo
  • perform a reverse image search using each poor quality photo as the query image and using this persons computer as the database to search for the higher quality images
  • and create a copy of each high quality image in one destination folder.

Example pseudocode

for each image in poorQualityImages:
    search ./macComputer for a higherQualityImage of image
    copy higherQualityImage to ./higherQualityImages

I need to perform this action once. I am looking for a tool, github repo or library which can perform this functionality more so than a deep understanding of content based image retrieval.


There's a post on reddit where someone was trying to do something similar

imgdupes is a program which seems like it almost achieves this, but I do not want to delete the duplicates, I want to copy the highest quality duplicate to a destination folder


Update

Emailed my previous image processing prof and he sent me this

Off the top of my head, nothing out of the box.

No guaranteed solution here, but you can narrow the search space. You’d need a little program that outputs the MSE or SSIM similarity index between two images, and then write another program or shell script that scans the hard drive and computes the MSE between each image on the hard drive and each query image, then check the images with the top X percent similarity score.

Something like that. Still not maybe guaranteed to find everything you want. And if the low quality images are of different pixel dimensions than the high quality images, you’d have to do some image scaling to get the similarity index. If the poor quality images have different aspect ratios, that’s even worse.

So I think it’s not hard but not trivial either. The degree of difficulty is partly dependent on the nature of the corruption in the low quality images.


UPDATE

Github project I wrote which achieves what I want


Solution

  • What you are looking for is called image hashing . In this answer you will find a basic explanation of the concept, as well as a go-to github repo for plug-and-play application.

    Basic concept of Hashing

    From the repo page: "We have developed a new image hash based on the Marr wavelet that computes a perceptual hash based on edge information with particular emphasis on corners. It has been shown that the human visual system makes special use of certain retinal cells to distinguish corner-like stimuli. It is the belief that this corner information can be used to distinguish digital images that motivates this approach. Basically, the edge information attained from the wavelet is compressed into a fixed length hash of 72 bytes. Binary quantization allows for relatively fast hamming distance computation between hashes. The following scatter plot shows the results on our standard corpus of images. The first plot shows the distances between each image and its attacked counterpart (e.g. the intra distances). The second plot shows the inter distances between altogether different images. While the hash is not designed to handle rotated images, notice how slight rotations still generally fall within a threshold range and thus can usually be matched as identical. However, the real advantage of this hash is for use with our mvp tree indexing structure. Since it is more descriptive than the dct hash (being 72 bytes in length vs. 8 bytes for the dct hash), there are much fewer false matches retrieved for image queries. "

    Another blogpost for an in-depth read, with an application example.

    Available Code and Usage

    A github repo can be found here. There are obviously more to be found. After importing the package you can use it to generate and compare hashes:

    >>> from PIL import Image
    >>> import imagehash
    >>> hash = imagehash.average_hash(Image.open('test.png'))
    >>> print(hash)
    d879f8f89b1bbf
    >>> otherhash = imagehash.average_hash(Image.open('other.bmp'))
    >>> print(otherhash)
    ffff3720200ffff
    >>> print(hash == otherhash)
    False
    >>> print(hash - otherhash)
    36
    

    The demo script find_similar_images also on the mentioned github, illustrates how to find similar images in a directory.