Search code examples
algorithmdata-structureshashmapmerkle-tree

What is the time complexity of comparing two Merkle trees?


I have a simple recursive function that compares two merkle trees and accumulates the differences in the leaf nodes. However, I am unable to measure the time complexity of it. Specifically, I would like to see how does it compare against comparing two Hashtables or two BSTs. A group of leaves is a row in this case and they share a rowid. A group of rows form the entire merkle tree. In the code below I am just accumulating the differences at the leaf level.

def diff_helper(node1: MNode, node2: MNode, diff: List[Difference]):
    if not node1 and not node2:
        return
    elif node1.rowid==node2.rowid and node1.signature==node2.signature and node1.nodetype==NodeType.Row and node2.nodetype==NodeType.Row:
        return
    elif node1.rowid==node2.rowid and node1.signature!=node2.signature and node1.nodetype==NodeType.Row and node2.nodetype==NodeType.Row:
        diff_helper(node1.left, node2.left, diff)
        diff_helper(node1.right, node2.right, diff)
    elif node1.rowid==node2.rowid and node1.signature!=node2.signature and node1.nodetype==NodeType.Leaf and node2.nodetype==NodeType.Leaf:
        diff.append(Difference(node1.rowid, node1.column, node1.value, node2.value))
    else:
        diff_helper(node1.left, node2.left, diff)
        diff_helper(node1.right, node2.right, diff)

Time complexity:

On the best case, I see that this is a constant operation since the root hashes of both trees would be the same. On the worst case, the number of comparisons is the total number of all leaf nodes.

Question:

I can sense that merkle trees does better than plain hashtables because of the ability to prune the trees much faster. However, I am unable to represent this in Big O terms.

A comparable hashtable implementation would be to do a traversal of the rowids and constant time lookup on the second hashtable. Once you find the values of the rowid, you'll probably do a linear comparison of each leaf if the leaf level data is stored as a hashtable.


Solution

  • This is a good example of an output-sensitive algorithm, whose worst-case running time is most accurately specified by introducing a new quantity. For this problem, if h is the height of the trees and d is the number of leafs diffs, then the worst case is O(d + d (h − log d)). At level ℓ of the tree we can have up to min(2, d) nodes with diffs, and we can charge the cost of matching nodes to their parents, so we just have to bound this sum. The crossover point is ℓ = log d, so ∑ℓ=0...log d 2 is 2log d + 1 − 1 = Θ(d). Then there are h − log d levels remaining with d diffs.