python arrays list sorting performance-testing

Efficient solution for merging 2 sorted lists in Python

I am teaching my self Python by starting with the crash course published by Google. One of the practice problems is to write a function that takes 2 sorted lists, merges them together, and returns a sorted list. The most obvious solution is:

def linear_merge(list1, list2):
  list = list1 + list2
  list.sort()
  return list

Obviously above is not very efficient, or so I thought, because on the backend the sort function will have to run over the entire output list again. The problem asks for an efficient way of implementing this function, presumably that it can work well on huge lists. My code was similar to Google's answer, but I tweaked it a bit to make it a bit faster:

def linear_merge_goog(list1, list2):
  result = []
  while len(list1) and len(list2):
    if list1[-1] > list2[-1]:
      result.append(list1.pop())
    else:
      result.append(list2.pop())

  result.extend(list1)
  result.extend(list2)
  return result[::-1]

Original Google code was poping from the front of the array, but even they make a note that it's much more efficient to pop form the back of the array and than reverse it.

I tried to run both functions with large 20 million entry arrays, and the simple stupid combine and sort function comes up on top by a margin of 3X+ every time. Sub 1 second vs. over 3 seconds for what should be the more efficient method.

Any ideas? Am I missing something. Does it have to do with built in sort function being compiled while my code is interpreted (doesn't sound likely). Any other ideas?

Solution

Its because of the Python implementation of .sort(). Python uses something called Timsort.

Timsort is a type of mergesort. Its distinguishing characteristic is that it identifies "runs" of presorted data that it uses for the merge. In real world data, sorted runs in unsorted data are very common and you can sort two sorted arrays in O(n) time if they are presorted. This can cut down tremendously on sort times which typically run in O(nlog(n)) time.

So what's happening is that when you call list.sort() in Python, it identifies the two runs of sorted data list1 and list2 and merges them in O(n) time. Additionally, this implementation is compiled C code which will be faster than an interpreted Python implementation doing the same thing.