Search code examples
pythoncomparesubtraction

check if there is Exactly the same image as input image


I want to know How can i Find an image in Massive data (there are a lot of images in a Folder) and i want to Find image which is Exactly the same as input image (Given an input image from another folder not in the data folder ) and Compare the input image with all of the massive data , if it found Exactly The Same Image ,then show its name as output(the name of the Same Image in Folder,Not input name) (for example: dafs.jpg)

using python

I am thinking about comparing the exact value of RGB pixels and Subtract the pixel of input image from each of the images in the folder

but i don't know how to do that in python


Solution

  • Comparing RGB Pixel Values

    You could use the pillow module to get access to the pixel data of a particular image. Keep in mind that pillow supports these image formats.

    If we make a few assumptions about what it means for 2 images to be identical, based on your description, both images must:

    • Have the same dimensions (height and width)
    • Have the same RGB pixel values (the RGB values of pixel [x, y] in the input image must be the same as the RGB values of pixel [x, y] in the output image)
    • Be of the same orientation (related to the previous assumption, an image is considered to be not identical compared to the same image rotated by 90 degrees)

    then if we have 2 images using the pillow module

    from PIL import Image
    
    original = Image.open("input.jpg")
    possible_duplicate = Image.open("output.jpg")
    

    the following code would be able to compare the 2 images to see if they were identical

    def compare_images(input_image, output_image):
      # compare image dimensions (assumption 1)
      if input_image.size != output_image.size:
        return False
    
      rows, cols = input_image.size
    
      # compare image pixels (assumption 2 and 3)
      for row in range(rows):
        for col in range(cols):
          input_pixel = input_image.getpixel((row, col))
          output_pixel = output_image.getpixel((row, col))
          if input_pixel != output_pixel:
            return False
    
      return True
    

    by calling

    compare_images(original, possible_duplicate)
    

    Using this function, we could go through a set of images

    from PIL import Image
    
    def find_duplicate_image(input_image, output_images):
      # only open the input image once
      input_image = Image.open(input_image)
    
      for image in output_images:
        if compare_images(input_image, Image.open(image)):
          return image
    

    Putting it all together, we could simply call

    original = "input.jpg"
    possible_duplicates = ["output.jpg", "output2.jpg", ...]
    
    duplicate = find_duplicate_image(original, possible_duplicates)
    

    Note that the above implementation will only find the first duplicate, and return that. If no duplicate is found, None will be returned.

    One thing to keep in mind is that performing a comparison on every pixel like this can be costly. I used this image and ran compare_images using this as the input and the output 100 times using the timeit module, and took the average of all those runs

    num_trials = 100
    trials = timeit.repeat(
        repeat=num_trials,
        number=1,
        stmt="compare_images(Image.open('input.jpg'), Image.open('input.jpg'))",
        setup="from __main__ import compare_images; from PIL import Image"
    )
    avg = sum(trials) / num_trials
    
    print("Average time taken per comparison was:", avg, "seconds")
    
    # Average time taken per comparison was 1.3337286046380177 seconds
    

    Note that this was done on an image that was only 600 by 600 pixels. If you did this with a "massive" set of possible duplicate images, where I will take "massive" to mean at least 1M images of similar dimensions, this could possibly take ~15 days (1,000,000 * 1.28s / 60 seconds / 60 minutes / 24 hours) to go through and compare each output image to the input, which is not ideal.

    Also keep in mind that these metrics will vary based on the machine and operating system you are using. The numbers I provided are more for illustrative purposes.

    Alternative Implementation

    While I haven't fully explored this implementation myself, one method you could try would be to precompute a hash value of the pixel data of each of your images in your collection using a hash function. If you stored these in a database, with each hash containing a link to the original image or image name, then all you would have to do is calculate the hash of the input image using the same hashing function and compare the hashes instead. This would same lots of computation time, and would make a much more efficient algorithm.

    This blog post describes one implementation for doing this.

    Update - 2018-08-06

    As per the request of the OP, if you were given the directory of the possible duplicate images and not the explicit image paths themselves, then you could use the os and ntpath modules like so

    import ntpath
    import os
    
    def get_all_images(directory):
      image_paths = []
    
      for filename in os.listdir(directory):
        # to be as careful as possible, you might check to make sure that
        # the file is in fact an image, for instance using
        # filename.endswith(".jpg") to check for .jpg files for instance
        image_paths.append("{}/{}".format(directory, filename))
    
      return image_paths
    
    def get_filename(path):
      return ntpath.basename(path)
    

    Using these functions, the updated program might look like

    possible_duplicates = get_all_images("/path/to/images")
    duplicate_path = find_duplicate_image("/path/to/input.jpg", possible_duplicates)
    if duplicate_path:
      print(get_filename(duplicate_path))
    

    The above will only print the name of the duplicate image if there was a duplicate, otherwise, it will print nothing.