Search code examples
pythonexif

Compute hash of only the core image data (excluding metadata) for an image


I'm writing a script to calculate the MD5 sum of an image excluding the EXIF tag.

In order to do this accurately, I need to know where the EXIF tag is located in the file (beginning, middle, end) so that I can exclude it.

How can I determine where in the file the tag is located?

The images that I am scanning are in the format TIFF, JPG, PNG, BMP, DNG, CR2, NEF, and some videos MOV, AVI, and MPG.


Solution

  • One simple way to do it is to hash the core image data. For PNG, you could do this by counting only the "critical chunks" (i.e. the ones starting with capital letters). JPEG has a similar but simpler file structure.

    The visual hash in ImageMagick decompresses the image as it hashes it. In your case, you could hash the compressed image data right away, so (if implemented correctly) a it should be just as quick as hashing the raw file.

    This is a small Python script illustrating the idea. It may or may not work for you, but it should at least give an indication to what I mean :)

    import struct
    import os
    import hashlib
    
    def png(fh):
        hash = hashlib.md5()
        assert fh.read(8)[1:4] == "PNG"
        while True:
            try:
                length, = struct.unpack(">i",fh.read(4))
            except struct.error:
                break
            if fh.read(4) == "IDAT":
                hash.update(fh.read(length))
                fh.read(4) # CRC
            else:
                fh.seek(length+4,os.SEEK_CUR)
        print "Hash: %r" % hash.digest()
    
    def jpeg(fh):
        hash = hashlib.md5()
        assert fh.read(2) == "\xff\xd8"
        while True:
            marker,length = struct.unpack(">2H", fh.read(4))
            assert marker & 0xff00 == 0xff00
            if marker == 0xFFDA: # Start of stream
                hash.update(fh.read())
                break
            else:
                fh.seek(length-2, os.SEEK_CUR)
        print "Hash: %r" % hash.digest()
    
    
    if __name__ == '__main__':
        png(file("sample.png"))
        jpeg(file("sample.jpg"))