Search code examples
algorithmheapmedian

Complexity of finding the median using 2 heaps


A way of finding the median of a given set of n numbers is to distribute them among 2 heaps. 1 is a max-heap containing the lower n/2 (ceil(n/2)) numbers and a min-heap containing the rest. If maintained in this way the median is the max of the first heap (along with the min of the second heap if n is even). Here's my c++ code that does this:

priority_queue<int, vector<int> > left;
priority_queue<int,vector<int>, greater<int> > right;
cin>>n; //n= number of items
for (int i=0;i<n;i++) {
    cin>>a;
    if (left.empty())
        left.push(a);
    else if (left.size()<=right.size()) {
            if (a<=right.top())
                left.push(a);
            else {
                left.push(right.top());
                right.pop();
                right.push(a);
            }
    }
    else {
        if (a>=left.top())
            right.push(a);
        else {
            right.push(left.top());
            left.pop();
            left.push(a);
        }
    }
}

We know that the heapify operation has linear complexity . Does this mean that if we insert numbers one by one into the two heaps as in the above code, we are finding the median in linear time?


Solution

  • Linear time heapify is for the cost of building a heap from an unsorted array as a batch operation, not for building a heap by inserting values one at a time.

    Consider a min heap where you are inserting a stream of values in increasing order. The value at the top of the heap is the smallest, so each value trickles all the way down to the bottom of the heap. Consider just the last half of the values inserted. At this time the heap will have very nearly its full height, which is log(n), so each value trickles down log(n) slots, and the cost of inserting n/2 values is O(n log(n))

    If I present a stream of values in increasing order to your median finding algorithm one of the things it has to do is build a min heap from a stream of values in increasing order so the cost of the median finding is O(n log(n)). In, fact the max heap is going to be doing a lot of deletes as well as insertions, but this is just a constant factor on top so I think the overall complexity is still O(n log(n))