Search code examples
pythonhashlib

Is there a faster way (than this) to calculate the hash of a file (using hashlib) in Python?


My current approach is this:

def get_hash(path=PATH, hash_type='md5'):
    func = getattr(hashlib, hash_type)()
    with open(path, 'rb') as f:
         for block in iter(lambda: f.read(1024*func.block_size, b''):
             func.update(block)
    return func.hexdigest()

It takes about 3.5 seconds to calculate the md5sum of a 842MB iso file on an i5 @ 1.7 GHz. I have tried different methods of reading the file, but all of them yield slower results. Is there, perhaps, a faster solution?

EDIT: I replaced 2**16 (inside the f.read()) with 1024*func.block_size, since the default block_size for most hashing functions supported by hashlib is 64 (except for 'sha384' and 'sha512' - for them, the default block_size is 128). Therefore, the block size is still the same (65536 bits).

EDIT(2): I did something wrong. It takes 8.4 seconds instead of 3.5. :(

EDIT(3): Apparently Windows was using the disk at +80% when I ran the function again. It really takes 3.5 seconds. Phew.

Another solution (~-0.5 sec, slightly faster) is to use os.open():

def get_hash(path=PATH, hash_type='md5'):
    func = getattr(hashlib, hash_type)()
    f = os.open(path, (os.O_RDWR | os.O_BINARY))
    for block in iter(lambda: os.read(f, 2048*func.block_size), b''):
        func.update(block)
    os.close(f)
    return func.hexdigest()

Note that these results are not final.


Solution

  • Using an 874 MiB random data file which required 2 seconds with the md5 openssl tool I was able to improve speed as follows.

    • Using your first method required 21 seconds.
    • Reading the entire file (21 seconds) to buffer and then updating required 2 seconds.
    • Using the following function with a buffer size of 8096 required 17 seconds.
    • Using the following function with a buffer size of 32767 required 11 seconds.
    • Using the following function with a buffer size of 65536 required 8 seconds.
    • Using the following function with a buffer size of 131072 required 8 seconds.
    • Using the following function with a buffer size of 1048576 required 12 seconds.
    def md5_speedcheck(path, size):
        pts = time.process_time()
        ats = time.time()
        m = hashlib.md5()
        with open(path, 'rb') as f:
            b = f.read(size)
            while len(b) > 0:
                m.update(b)
                b = f.read(size)
        print("{0:.3f} s".format(time.process_time() - pts))
        print("{0:.3f} s".format(time.time() - ats))
    

    Human time is what I noted above. Whereas processor time for all of these is about the same with the difference being taken in IO blocking.

    The key determinant here is to have a buffer size that is big enough to mitigate disk latency, but small enough to avoid VM page swaps. For my particular machine it appears that 64 KiB is about optimal.