I'm writing a script to calculate the MD5 sum of an image excluding the EXIF tag.
In order to do this accurately, I need to know where the EXIF tag is located in the file (beginning, middle, end) so that I can exclude it.
How can I determine where in the file the tag is located?
The images that I am scanning are in the format TIFF, JPG, PNG, BMP, DNG, CR2, NEF, and some videos MOV, AVI, and MPG.
One simple way to do it is to hash the core image data. For PNG, you could do this by counting only the "critical chunks" (i.e. the ones starting with capital letters). JPEG has a similar but simpler file structure.
The visual hash in ImageMagick decompresses the image as it hashes it. In your case, you could hash the compressed image data right away, so (if implemented correctly) a it should be just as quick as hashing the raw file.
This is a small Python script illustrating the idea. It may or may not work for you, but it should at least give an indication to what I mean :)
import struct
import os
import hashlib
def png(fh):
hash = hashlib.md5()
assert fh.read(8)[1:4] == "PNG"
while True:
try:
length, = struct.unpack(">i",fh.read(4))
except struct.error:
break
if fh.read(4) == "IDAT":
hash.update(fh.read(length))
fh.read(4) # CRC
else:
fh.seek(length+4,os.SEEK_CUR)
print "Hash: %r" % hash.digest()
def jpeg(fh):
hash = hashlib.md5()
assert fh.read(2) == "\xff\xd8"
while True:
marker,length = struct.unpack(">2H", fh.read(4))
assert marker & 0xff00 == 0xff00
if marker == 0xFFDA: # Start of stream
hash.update(fh.read())
break
else:
fh.seek(length-2, os.SEEK_CUR)
print "Hash: %r" % hash.digest()
if __name__ == '__main__':
png(file("sample.png"))
jpeg(file("sample.jpg"))