Search code examples
imagecompareimagemagickpng

Compare images and remove duplicates


I have two folders with images, they're all PNGs. One folder is a copy of the other with some images changed and some added. The filenames are the same but the image contents may be different. Other attributes like time stamps are completely random, unfortunately.

I want in the newer folder to remove the duplicates (by content) and just keep the updated and the new ones.

I installed ImageMagick to use the compare command but I can't figure it out. :-( Can you help me please? Thanks in advance!

Added: I'm on Mac OS X.


Solution

  • You don't say if you are on OSX/Linux or Windows, however, I can get you started. ImageMagick can calculate a hash (checksum) of all the pixel data in an image regardless of date or timestamp like this

    identify -format "%# %f\n" *.png
    
    25a3591a58550edd2cff65081eab11a86a6a62e006431c8c4393db8d71a1dfe4 blue.png
    304c0994c751e75eac86bedac544f716560be5c359786f7a5c3cd6cb8d2294df green.png
    466f1bac727ac8090ba2a9a13df8bfb6ada3c4eb3349087ce5dc5d14040514b5 grey.png
    042a7ebd78e53a89c0afabfe569a9930c6412577fcf3bcfbce7bafe683e93e8a hue.png
    d819bfdc58ac7c48d154924e445188f0ac5a0536cd989bdf079deca86abb12a0 lightness.png
    b63ad69a056033a300f23c31f9425df6f469e79c2b9f3a5c515db3b52c323a65 montage.png
    a42a5f0abac3bd2f6b4cbfde864342401847a120dacae63294edb45b38edd34e red.png
    10bf63fd725c5e02c56df54f503d0544f14f754d852549098d5babd8d3daeb84 sample.png
    e95042f227d2d7b2b3edd4c7eec05bbf765a09484563c5ff18bc8e8aa32c1a8e sat.png
    

    So, if you do that in each folder you will have the checksums of all the files with their names beside them in a separate file for each folder.

    If you then merge the two files and sort them you can find duplicates quite easily since the duplicated files will come up next to each other.

    Let's say, you run the above command in two folders dira and dirb like this

    cd dira
    identify -format "%# %f\n" *.png > $HOME/dira
    
    cd dirb
    identify -format "%# %f\n" *.png > $HOME/dirb
    

    Then you could do something like this in awk

    awk 'FNR==NR{name[$1]=$2;next}
                { 
                   if($1 in name){print $2 " duplicates " name[$1]}
                }' $HOME/dir*
    

    So, the $HOME/dir* part passes both the files into awk. The piece in {} after FNR==NR only applies to the first file read in, and as it is read, we save an associative array indexed by the hash containing the filenames. Then, on the second pass, we check if each hash has been seen, and if it has, we say that that it is a duplicate and output the name we found on the first pass from the hash name[] and the name we found on the second pass from $2.

    This won't work with filenames with spaces in them, so if that is a problem, change the identify command to put a colon between the hash and the filename like this:

    identify -format "%#:%f\n" *.png
    

    and change the awk to awk -F":" and it should work again.