Search code examples
algorithmrecursionheapclrs

Recursive relation for the MAX_HEAPIFY algorithm and the worst case


I came upon the recursive relation for the max-heapify algorithm when going through CLRS. My teacher had justified, quite trivially in fact, that the time complexity of the max-heapify process was O(logn), simply because the worst case is when the root has to 'bubble/float down' from the top all the way to the last level. This means we travel layer by layer, and hence the number of steps equals the number of levels/height of the heap, which, as we know, is bounded by logn. Fair enough.

The same however was proven in CLRS in a more rigorous manner via a recurrence relation. The worst case was said to occur when the last level is half filled and this has already been explained here. So as far as I understand from that answer, they arrived at this conclusion mathematically: we want to maximise the size of the left subtree relative to the heap size n i.e to maximise the value of L/n. And to achieve this we have to have the last level half filled so that the number of nodes in L (left subtree) is maximized and L/n is maximized.

Adding any more nodes to the last level will increase the number of nodes but bring no change to the value of L. So L/n will decrease, as heap becomes more balanced. All is fine and good as long as it's mathematical.

Now this is where I get stuck: Let's say I add one more node in this half-filled level. Practically, I fail to see how this somehow reduces the number of steps/comparisons that occur and is no longer the worst case. Even though I have added one more node, all the comparisons occur only in the left subtree and have nothing to do with the right subtree. Can someone convince me/help me realise why and how exactly does it work out that L/n has to be maximized for the worst case? I would appreciate an example input and how adding more nodes no longer makes it the worst possible case?


Solution

  • Let's say I add one more node in this half-filled level. Practically, I fail to see how this somehow reduces the number of steps/comparisons that occur and is no longer the worst case . Even though I have added one more node, all the comparisons occur only on the left subtree and has nothing to do with the right subtree.

    It is correct that this does not reduce the number of steps. However, when we speak of time complexity, we look for a relation between the number of steps and š‘›. If were to only look at the number of steps, we would only conclude that the worst case happens when the tree is infinitely large. Although that is a really bad case (the worst), that is not what the book means with "worst case" here. It is not just the number of steps that interests us, it is how that number relates to š‘›.

    We can argue about the terminology here, because usually "worst case" is not about something that depends on š‘›, but about variations that can exist for a given š‘›. For instance, when discussing worst case scenarios for a sorting algorithm, the worst and best cases are dependent on how the input data is organised (already sorted, reversed, ...etc). Here "worst case" is used for the shape of the (bottom layer of the) tree, which is directly inferred by the value of š‘›. Once you have š‘›, there is no variation possible there.

    However, for the recurrence relation, we must find the formula -- in terms of š‘› -- that gives an upper limit for the number of children in the left subtree, with the constraint that we want this formula to use simple arithmetic (for example: no flooring).

    Here is a graph where the blue bars represent the value of š‘›, and the orange bars represent the number of nodes in the left subtree.

    enter image description here

    The recurrence relation is based on the idea that the greatest subtree of both subtrees is the left subtree, so it represents the worst case. That subtree has a number of nodes that is somewhere between (š‘›-1)/2 and 2š‘›/3. The ratio between the number of nodes in the left subtree and the total number of nodes is maximised when the left subtree is full, and the right subtree has a lesser height.

    Here is the same data represented as a ratio:

    enter image description here

    You can see where these maxima occur: when š‘› is 2, 5, 11, 23, ... the ratio between the number of nodes in the left subtree and š‘› approaches 40%. This 40% represents the upper limit for the ratio, and is a safe "catch-all" for all values of š‘›.

    We need this ratio in the recurrence relation: that 40% can be rephrased: the number of nodes in the subtree has an upper bound of 2š‘›/3. And so the recurrence relation is

    Ā  Ā  Ā  Ā  Ā  Ā  Ā  š‘‡(š‘›) = š‘‡(2š‘›/3) + O(1)