Search code examples
c++algorithmcomplexity-theoryasymptotic-complexity

Top K smallest selection algorithm - O (n + k log n) vs O (n log k) for k << N


I'm asking this in regards to Top K algorithm. I'd think that O(n + k log n) should be faster, because well.. for instance if you try plugging in k = 300 and n = 100000000 for example, we can see that O(n + k log n) is smaller.

However when I do a benchmark with C++, it's showing me that O (n log k) is more than 2x faster. Here's the complete benchmarking program:

#include <iostream>
#include <vector>
#include <algorithm>
#include <iterator>
#include <ctime>
#include <cstdlib>
using namespace std;

int RandomNumber () { return rand(); }
vector<int> find_topk(int arr[], int k, int n)
{
   make_heap(arr, arr + n, greater<int>());

   vector<int> result(k);

   for (int i = 0; i < k; ++i)
   {
      result[i] = arr[0];
      pop_heap(arr, arr + n - i, greater<int>());
   }

   return result;
}

vector<int> find_topk2(int arr[], int k, int n)
{
   make_heap(arr, arr + k, less<int>());

   for (int i = k; i < n; ++i)
   {
      if (arr[i] < arr[0])
      {
     pop_heap(arr, arr + k, less<int>());
     arr[k - 1] = arr[i];
     push_heap(arr, arr + k, less<int>());
      }
   }

   vector<int> result(arr, arr + k);

   return result;
}


int main()
{
   const int n = 220000000;
   const int k = 300;

   srand (time(0));
   int* arr = new int[n];

   generate(arr, arr + n, RandomNumber);

   // replace with topk or topk2
   vector<int> result = find_topk2(arr, k, n);

   copy(result.begin(), result.end(), ostream_iterator<int>(cout, "\n"));


   return 0;
}

find_topk 's approach is to build a complete heap of size n, in O(n) and then remove the top element of heap k times O(log n). find_topk2 's approach is to build a heap of size k (O(k)) such that max element is at the top, and then from k to n, compare to see if any element is smaller than top element, and if so pop the top element, and push the new element which would mean n times O(log k). Both approach are written quite similarly so I don't believe any implementation detail (like creating temporaries etc.) can cause a difference besides the algo and the dataset (which is random).

I could actually profile the results of the benchmark and could see that find_topk actually called the comparison operator many more times than find_topk2. But I'm interested more in the reasoning of the theoretical complexity.. so two questions.

  1. Disregarding the implementation or benchmark, was I wrong in expecting that O(n + k log n) should be better than O(n log k)? If I'm wrong, please explain why and how to reason such that I can see O(n log k) is actually better.
  2. If I'm not wrong to expect no 1. Then why is my benchmark showing otherwise?

Solution

  • Big O in several variables is complex, since you need assumptions on how your variables scale with one another, so you can take unambiguously the limit to infinity.

    If eg. k ~ n^(1/2), then O(n log k) becomes O(n log n) and O(n + k log n) becomes O(n + n^(1/2) log n) = O(n), which is better.

    If k ~ log n, then O(n log k) = O(n log log n) and O(n + k log n) = O(n), which is better. Note that log log 2^1024 = 10, so the constants hidden in the O(n) may be greater than log log n for any realistic n.

    If k = constant, then O(n log k) = O(n) and O(n + k log n) = O(n), which is the same.

    But the constants play a big role: for instance, building a heap may involve reading the array 3 times, whereas building a priority queue of length k as you go only requires one pass through the array, and a small constant times log k for the lookup.

    Which is "better" is therefore unclear, although my quick analysis tended to show that O(n + k log n) performs better under mild assumptions on k.

    For instance, if k is a very small constant (say k = 3), then I'm ready to bet that the make_heap approach performs worse than the priority queue one on real world data.

    Use asymptotic analysis wisely, and above all, profile your code before drawing conclusions.