Search code examples
pythonheapsortbinary-heap

Why is this implementation of binary heap slower than that of Python's stdlib?


I have been implementing my own heap module to help me understand the heap data structure. I understand how they work and are managed, but my implementation is significantly slower than the standard python heapq module while preforming a heap sort (for list of size 100,000, heapq takes 0.6s while my code takes 2s (was originally 2.6s, cut it down to 2s by taking len() statements out of percDown and passing through the length so it doesn't have to calcuate len every time the method calls itself). Here is my implementation:

def percDown(lst, start, end, node):
    #Moves given node down the heap, starting at index start, until the heap property is
    #satisfied (all children must be larger than their parent)
    iChild = 2 * start + 1
    i = start
    # if the node has reached the end of the heap (i.e. no children left),
    # return its index (we are done)
    if iChild > end - 1:
        return start
    #if the second child exists and is smaller than the first child, use that child index
    #for comparing later
    if iChild + 1 < end and lst[iChild + 1] < lst[iChild]:
        iChild += 1
    #if the smallest child is less than the node, it is the new parent
    if lst[iChild] < node:
        #move the child to the parent position
        lst[start] = lst[iChild]
        #continue recursively going through the child nodes of the
        # new parent node to find where node is meant to go
        i = percDown(lst, iChild, end, node)
    return i

popMin: pops the minimum value (lst[0]) and reorders the heap

def popMin(lst):
    length = len(lst)
    if (length > 1):
        min = lst[0]
        ele = lst.pop()
        i = percDown(lst, 0, length - 1, ele)
        lst[i] = ele
        return min
    else:
        return lst.pop()

heapify: turns a list into a heap in-place

def heapify(lst):
    iLastParent = math.floor((len(lst) - 1) / 2)
    length = len(lst)
    while iLastParent >= 0:
        ele = lst[iLastParent]
        i = percDown(lst, iLastParent, length, lst[iLastParent])
        lst[i] = ele
        iLastParent -= 1

sort: heapsorts the given list using the methods above (not in-place)

def sort(lst):
    result = []
    heap.heapify(lst)
    length = len(lst)
    for z in range(0, length):
        result.append(heap.popMin(lst))
    return result

Did I mistakenly add complexity to the algorithm/heap creation, or is it just the python heapq module being heavily optimized? I have a feeling it is the former, as 0.6s vs 2s is a huge difference.


Solution

  • 0.6s vs. 2.6s is a bit less than 4x difference. Is that "too big"?

    That's not enough information to answer. A 4x different certainly could be caused by getting the algorithm wrong… but there's really no way to tell without testing at different sizes.

    If you get, say, only a 1.2x different at 1000, a 4x difference at 100000, and a 12x difference at 1000000, then your algorithmic complexity is most likely worse, which means you probably did get something wrong, and that's something you need to fix.

    On the other hand, if it's about 4x different at all three sizes, then there's just a bigger constant multiplier in your overhead. Most likely because you've got an inner loop running in Python, while the (CPython) stdlib version is using the _heapq accelerator module that does that same loop in C, as explained in Martijn Pieters' answer. So, you didn't get anything wrong. You can probably micro-optimize a bit, but ultimately you're going to have to either rewrite the core of your code in C or run it in a JIT-optimized interpreter to get anywhere near as good as the stdlib. And really, if you're just writing this to understand the algorithm, you don't need to do that.

    As a side note, you might want to try running the comparison in PyPy. Most of its stdlib is written in pure Python, with no special optimizations, but the optimized JIT compiler makes it almost as fast as the native C code in CPython. And that same JIT will be applied to your code, meaning your unoptimized code is often nearly as as as the native C code in CPython too. Of course there's no guarantee of that, and it doesn't change the fact that you always need to test at different sizes if you're trying to test for algorithmic complexity.