Search code examples
pythonhashchecksum

Calculate and add hash to a file in python


a way of checking if a file has been modified, is calculating and storing a hash (or checksum) for the file. Then at any point the hash can be re-calculated and compared against the stored value.

I'm wondering if there is a way to store the hash of a file in the file itself? I'm thinking text files.

The algorithm to calculate the hash should be iterative and consider that the hash will be added to the file the hash is being calculated for... makes sense? Anything available?

Thanks!

edit: https://security.stackexchange.com/questions/3851/can-a-file-contain-its-md5sum-inside-it


Solution

  • from Crypto.Hash import HMAC
    secret_key = "Don't tell anyone"
    h = HMAC.new(secret_key)
    text = "whatever you want in the file"
    ## or: text = open("your_file_without_hash_yet").read()
    h.update(text)
    with open("file_with_hash") as fh:
        fh.write(text)
        fh.write(h.hexdigest())
    

    Now, as some people tried to point out, though they seemed confused - you need to remember that this file has the hash on the end of it and that the hash is itself not part of what gets hashed. So when you want to check the file, you would do something along the lines of:

    end_len = len(h.hex_digest())
    all_text = open("file_with_hash").read()
    text, expected_hmac = all_text[:end_len], all_text[end_len:]
    h = HMAC.new(secret_key)
    h.update(text)
    if h.hexdigest() != expected_hmac:
        raise "Somebody messed with your file!"
    

    It should be clear though that this alone doesn't ensure your file hasn't been changed; the typical use case is to encrypt your file, but take the hash of the plaintext. That way, if someone changes the hash (at the end of the file) or tries changing any of the characters in the message (the encrypted portion), things will mismatch and you will know something was changed.

    A malicious actor won't be able to change the file AND fix the hash to match because they would need to change some data, and then rehash everything with your private key. So long as no one knows your private key, they won't know how to recreate the correct hash.