Search code examples
amazon-web-servicesamazon-s3checksumsha256amazon-glacier

Why is my sha256 checksum incompatible with aws glacier checksum response?


I have an archive file in ubuntu server. I uploaded this file in AWS glacier using aws cli. at the finishing, AWS gave me a checksum like this:

{"checksum": "6c126443c882b8b0be912c91617a5765050d7c99dc43b9d30e47c42635ab02d5"}

but when i checked the checksum in own server like this:

sunny@server:~/sha256sum backup.zip

return this checksum:

5ba29292a350c4a8f194c78dd0ef537ec21ca075f1fe649ae6296c7100b25ba8

why between checksums has a difference?


Solution

  • While the checksum returned by Glacier uses SHA-256, it is not a simple SHA-256 sum over the entire object. Rather, it calculates hashes for each megabyte of data, and calculates a hash for each pair of hashes, and repeats the process till one hash remains. For more information, see the documentation.

    Here's is a simple implementation in Python

    #!/usr/bin/env python3
    import hashlib
    import sys
    import binascii
    
    # Given a file object (opened in binary mode), calculate the checksum used by glacier
    def calc_hash_tree(fileobj):
        chunk_size = 1048576
    
        # Calculate a list of hashes for each chunk in the fileobj
        chunks = []
        while True:
            chunk = f.read(chunk_size)
            if len(chunk) == 0:
                break
            chunks.append(hashlib.sha256(chunk).digest())
        
        # Now calculate each level of the tree till one digest remains
        while len(chunks) > 1:
            next_chunks = []
            while len(chunks) > 1:
                next_chunks.append(hashlib.sha256(chunks.pop(0) + chunks.pop(0)).digest())
            if len(chunks) > 0:
                next_chunks.append(chunks.pop(0))
            chunks = next_chunks
    
        # The final remaining hash is the root of the tree:
        return binascii.hexlify(chunks[0]).decode("utf-8")
    
    if __name__ == "__main__":
        with open(sys.argv[1], "rb") as f:
            print(calc_hash_tree(f))
    

    You can call it on a single file like this:

    $ ./glacier_checksum.py backup.zip