Search code examples
pythonhashverification

Is hashing small chunks of a file in 3 places a good verification?


I move a lot of large video files around networks and I've adopted a method of comparing files after copying to ensure they are the same. The method is very fast, even over a network. Trouble is, I'm not sure its adequate. Essentially, I hash a small chunk from the start/middle/end of each file, concatenate the hashes and compare them. I would love some feedback or advice about the validity of this method. For example, what if one of the two files contained an edit which left the file size the same and also fell outside the scope of the three hash points. Could the method lie about the equality of the files in such a case?

Here's what I do (in Python):

def get_hash(filename):
    try:
        f = open(filename, 'rb')
    except:
        raise

    hash_string = ''
    chunk_size = 1024
    size = os.stat(filename).st_size
    if size >= (chunk_size * 3):
        hash_string += sha256(f.read(chunk_size)).hexdigest()
        f.seek(size / 2)
        hash_string += sha256(f.read(chunk_size)).hexdigest()
        f.seek(size - chunk_size)
        hash_string += sha256(f.read(chunk_size)).hexdigest()
    else:
        get_full_hash(filename)

    f.close()
    return unicode(hash_string)

Solution

  • Could the method lie about the equality of the files in such a case?

    Yes, of course it could.

    Even a hash of the entire files could "lie" - although the probability of a collision would be extremely low using something like sha256. The only way to be sure is to compare the entire contents of both files.

    How much slower is hashing the whole file?

    Is your method "good enough"? Possibly. If you were looking for duplicate video files, this might be a decent option. If you're concerned about a few bytes getting corrupted during a large copy, this probably isn't a very good option.

    Ultimately, it's a matter of how much risk you're willing to take.