Compare images and remove duplicates

I have two folders with images, they're all PNGs. One folder is a copy of the other with some images changed and some added. The filenames are the same but the image contents may be different. Other attributes like time stamps are completely random, unfortunately.

I want in the newer folder to remove the duplicates (by content) and just keep the updated and the new ones.

I installed ImageMagick to use the compare command but I can't figure it out. :-( Can you help me please? Thanks in advance!

Added: I'm on Mac OS X.

Solution

You don't say if you are on OSX/Linux or Windows, however, I can get you started. ImageMagick can calculate a hash (checksum) of all the pixel data in an image regardless of date or timestamp like this

identify -format "%# %f\n" *.png

25a3591a58550edd2cff65081eab11a86a6a62e006431c8c4393db8d71a1dfe4 blue.png
304c0994c751e75eac86bedac544f716560be5c359786f7a5c3cd6cb8d2294df green.png
466f1bac727ac8090ba2a9a13df8bfb6ada3c4eb3349087ce5dc5d14040514b5 grey.png
042a7ebd78e53a89c0afabfe569a9930c6412577fcf3bcfbce7bafe683e93e8a hue.png
d819bfdc58ac7c48d154924e445188f0ac5a0536cd989bdf079deca86abb12a0 lightness.png
b63ad69a056033a300f23c31f9425df6f469e79c2b9f3a5c515db3b52c323a65 montage.png
a42a5f0abac3bd2f6b4cbfde864342401847a120dacae63294edb45b38edd34e red.png
10bf63fd725c5e02c56df54f503d0544f14f754d852549098d5babd8d3daeb84 sample.png
e95042f227d2d7b2b3edd4c7eec05bbf765a09484563c5ff18bc8e8aa32c1a8e sat.png

So, if you do that in each folder you will have the checksums of all the files with their names beside them in a separate file for each folder.

If you then merge the two files and sort them you can find duplicates quite easily since the duplicated files will come up next to each other.

Let's say, you run the above command in two folders dira and dirb like this

cd dira
identify -format "%# %f\n" *.png > $HOME/dira

cd dirb
identify -format "%# %f\n" *.png > $HOME/dirb

Then you could do something like this in awk

awk 'FNR==NR{name[$1]=$2;next}
            { 
               if($1 in name){print $2 " duplicates " name[$1]}
            }' $HOME/dir*

So, the $HOME/dir* part passes both the files into awk. The piece in {} after FNR==NR only applies to the first file read in, and as it is read, we save an associative array indexed by the hash containing the filenames. Then, on the second pass, we check if each hash has been seen, and if it has, we say that that it is a duplicate and output the name we found on the first pass from the hash name[] and the name we found on the second pass from $2.

This won't work with filenames with spaces in them, so if that is a problem, change the identify command to put a colon between the hash and the filename like this:

identify -format "%#:%f\n" *.png

and change the awk to awk -F":" and it should work again.