Search code examples
pythonencodingsha256hashlib

Python hashing with hashlib throwing error even though data is encoded. How to fix?


I have a function that calculates the hash of all files in a directory. As part of this, each file is opened, chunks are read, and the hash is updated:

import hashlib, os

def get_dir_hash(directory, verbose=0):
    hash = hashlib.sha256()
    if not os.path.exists(directory):
        return -1

    try:
        for root, dirs, files in os.walk(directory):
            for names in files:
                if verbose == 1:
                    print(f"Hashing {names}")

                filepath = os.path.join(root, names)
                try:
                    f1 = open(filepath, 'rb')
                except:
                    # You can't open the file for some reason
                    if f1 is not None:
                        f1.close()
                    continue

                while 1:
                    # Read file in as little chunks
                    buf = f1.read(4096)
                    if not buf:
                        break
                    hash.update(hashlib.sha256(str(buf).encode('utf-8')).hexdigest())

                if f1 is not None:
                    f1.close()

    except:
        import traceback
        # Print the stack traceback
        traceback.print_exc()
        return -2

    return hash.hexdigest()

Note that I read a chunk of bytes, convert to string, and encode to utf-8 as suggested by other answers here in SO:

hash.update(hashlib.sha256(str(buf).encode('utf-8')).hexdigest())

However, I still get this error:

Traceback (most recent call last):
  File "/home/user/Work/mmr6/mmr/util/dir_hash.py", line 33, in get_dir_hash
    hash.update(hashlib.sha256(str(buf).encode('utf-8')).hexdigest())
TypeError: Unicode-objects must be encoded before hashing

What am I missing?


Solution

  • I found what you were missing :
    When you write hash.update(hashlib.sha256(str(buf).encode('utf-8')).hexdigest())
    the part with str(buf).encode('utf-8') is a bit useless as you can write directly buf (it's already a <bytes> object)
    However hashlib.sha256(buf).hexdigest() returns a str instance so that's where the error comes from.
    The fixed version of the line would be

    hash.update(hashlib.sha256(buf).hexdigest().encode("utf-8"))
    

    I'm not 100% sure if that is what you wanted to do so feel free to tell me