I'm asking this in regards to Top K algorithm. I'd think that O(n + k log n) should be faster, because well.. for instance if you try plugging in k = 300 and n = 100000000 for example, we can see that O(n + k log n) is smaller.
However when I do a benchmark with C++, it's showing me that O (n log k) is more than 2x faster. Here's the complete benchmarking program:
#include <iostream>
#include <vector>
#include <algorithm>
#include <iterator>
#include <ctime>
#include <cstdlib>
using namespace std;
int RandomNumber () { return rand(); }
vector<int> find_topk(int arr[], int k, int n)
{
make_heap(arr, arr + n, greater<int>());
vector<int> result(k);
for (int i = 0; i < k; ++i)
{
result[i] = arr[0];
pop_heap(arr, arr + n - i, greater<int>());
}
return result;
}
vector<int> find_topk2(int arr[], int k, int n)
{
make_heap(arr, arr + k, less<int>());
for (int i = k; i < n; ++i)
{
if (arr[i] < arr[0])
{
pop_heap(arr, arr + k, less<int>());
arr[k - 1] = arr[i];
push_heap(arr, arr + k, less<int>());
}
}
vector<int> result(arr, arr + k);
return result;
}
int main()
{
const int n = 220000000;
const int k = 300;
srand (time(0));
int* arr = new int[n];
generate(arr, arr + n, RandomNumber);
// replace with topk or topk2
vector<int> result = find_topk2(arr, k, n);
copy(result.begin(), result.end(), ostream_iterator<int>(cout, "\n"));
return 0;
}
find_topk 's approach is to build a complete heap of size n, in O(n) and then remove the top element of heap k times O(log n). find_topk2 's approach is to build a heap of size k (O(k)) such that max element is at the top, and then from k to n, compare to see if any element is smaller than top element, and if so pop the top element, and push the new element which would mean n times O(log k). Both approach are written quite similarly so I don't believe any implementation detail (like creating temporaries etc.) can cause a difference besides the algo and the dataset (which is random).
I could actually profile the results of the benchmark and could see that find_topk actually called the comparison operator many more times than find_topk2. But I'm interested more in the reasoning of the theoretical complexity.. so two questions.
Big O in several variables is complex, since you need assumptions on how your variables scale with one another, so you can take unambiguously the limit to infinity.
If eg. k ~ n^(1/2), then O(n log k) becomes O(n log n) and O(n + k log n) becomes O(n + n^(1/2) log n) = O(n), which is better.
If k ~ log n, then O(n log k) = O(n log log n) and O(n + k log n) = O(n), which is better. Note that log log 2^1024 = 10, so the constants hidden in the O(n) may be greater than log log n for any realistic n.
If k = constant, then O(n log k) = O(n) and O(n + k log n) = O(n), which is the same.
But the constants play a big role: for instance, building a heap may involve reading the array 3 times, whereas building a priority queue of length k as you go only requires one pass through the array, and a small constant times log k for the lookup.
Which is "better" is therefore unclear, although my quick analysis tended to show that O(n + k log n) performs better under mild assumptions on k.
For instance, if k is a very small constant (say k = 3), then I'm ready to bet that the make_heap
approach performs worse than the priority queue one on real world data.
Use asymptotic analysis wisely, and above all, profile your code before drawing conclusions.