Search code examples
pythonhashmd5digest

Getting hash (digest) of a file in Python - reading whole file at once vs reading line by line


I need to get a hash (digest) of a file in Python.

Generally, when processing any file content it is adviced to process it gradually line by line due to memory concerns, yet I need a whole file to be loaded in order to obtain its digest.

Currently I'm obtaining hash in this way:

import hashlib

def get_hash(f_path, mode='md5'):
    h = hashlib.new(mode)
    with open(f_path, 'rb') as file:
        data = file.read()
    h.update(data)
    digest = h.hexdigest()
    return digest

Is there any other way to perform this in more optimized or cleaner manner?

Is there any significant improvement in reading file gradually line by line over reading whole file at once when still the whole file must be loaded to calculate the hash?


Solution

  • According to the documentation for hashlib.update(), you don't need to concern yourself over the block size of different hashing algorithms. However, I'd test that a bit. But, it seems to check out, 512 is the block size of MD5, and if you change it to anything else, the results are the same as reading it all in at once.

    import hashlib
    
    def get_hash(f_path, mode='md5'):
        h = hashlib.new(mode)
        with open(f_path, 'rb') as file:
            data = file.read()
        h.update(data)
        digest = h.hexdigest()
        return digest
    
    def get_hash_memory_optimized(f_path, mode='md5'):
        h = hashlib.new(mode)
        with open(f_path, 'rb') as file:
            block = file.read(512)
            while block:
                h.update(block)
                block = file.read(512)
    
        return h.hexdigest()
    
    digest = get_hash('large_bin_file')
    print(digest)
    
    digest = get_hash_memory_optimized('large_bin_file')
    print(digest)
    

    > bcf32baa9b05ca3573bf568964f34164
    > bcf32baa9b05ca3573bf568964f34164