Search code examples
pythonalgorithmhashamazon-web-serviceshashtree

Reverse tree building (with an odd number of children)


I just found out about the AWS Glacier service and wanted to write a small Python application to upload files via the REST API. I took a look at the required headers and stumbled upon x-amz-sha256-tree-hash. I need to calculate the SHA-256 hash of the entire file as well as the hash of the parent of all hashes of each 1 MB chunk. This leads to the following tree:

AWS's SHA-256 Tree Hash procedure

(Image taken from here)

I already made a function that reads 1 MB chunks and a class which calculates their hashes on-the-fly but then I completely struggle:

In my application I made a class called chunk which takes data and calculates a hash in the __init__ method as well as holds parent and children (like a regular tree). When the user opens a file those chunks instances will be generated properly with their respective hashes (in this example that would be 7 chunk instances).

Now I have two big problems that are connected to each other:

  1. How do I build this tree in reverse? I basically need to make a new chunk for each two chunk instances on the lowest layer and calculate a hash based on those two hashes. But where do I store that parent? In the children of the parent and do reverse tree walking?
  2. How does that work with an odd number of children? If I have an algorithm that goes through each parent layer then I would miss the last (0.5 MB) chunk.

I checked out this topic on SO but that method only works with an even children count which is not always given.

Can you help me finding a way/an algorithm/an approach on how to solve this issue?

Thanks in advance!

Paul


Solution

  • First calculate the number of levels, then

    def proclevel(levels):
        if levels > 0:
            generator = proclevel(levels - 1)
            temp = None
            for firsthash, secondhash in generator:
                if not temp: temp = hashofthem(firsthash, secondhash)
                else: yield temp, hashofthem(firsthash, secondhash); temp = None
            #If odd number of packets
            if temp: yield temp, None
        else:
            temp = None
            for chunk in chunks:
                if not temp: temp = hash(chunk)
                else: yield temp, hash(chunk); temp = None
            if temp: yield temp, None
    

    Make sure to handle None as second argument in hashofthem :)