Search code examples
heappriority-queue

For finding top K elements using heap, which approach is better - NlogK or KLogN?


For finding top K elements using heap, which approach is better?

  1. NlogK, use Minheap of size K and remove the minimum element so top k elements remain in heap
  2. KlogN, use Maxheap, store all elements and then extract top K elements

I did some calculations and at no point, I see that NLogK better than KlogN.

N= 16 (2^4), k = 8 (2^3) 
O(Nlog(K)) = 16* 3 = 48
O(Klog(N)) = 8 * 4 = 32

N= 16 (2^4), k = 12  (log to base 2 = 3.5849)
O(Nlog(K)) = 16* 3.5849 = 57.3584
O(Klog(N)) = 12 * 4 = 48

N= 256 (2^8), k = 4 (2^2)
O(Nlog(K)) = 256* 2 = 512
O(Klog(N)) = 4 * 8 = 32

N= 1048576 (2^20), k = 16 (2^4)
O(Nlog(K)) = 1048576* 4 = 4194304
O(Klog(N)) = 16 * 20 = 320

N= 1048576 (2^20), k = 1024 (2^10)
O(Nlog(K)) = 1048576* 10 = 10485760
O(Klog(N)) = 1024 * 20 = 20480

N= 1048576 (2^20), k = 524288 (2^19)
O(Nlog(K)) = 1048576* 19 = 19922944
O(Klog(N)) = 524288 * 20 = 10485760

I just wanted to confirm that my approach is correct and adding all elements to heap and extract top k elements is always the best approach. (and also simpler one)


Solution

  • I did some calculations and at no point, I see that NLogK better than KlogN.

    Since, K <= N, NlogK will always be more than or equal to KlogN.

    That does not mean, min heap approach is going to take more time than max heap approch.

    Need to consider the below,

    1. In Min heap approach, we will update the heap only if the next value is larger than the head. If the array is in ascending order, we will do it (N-K) times, if it is descending order we will not update it at all. On average, the number of times the tree gets updated is considerable less than N.

    2. In Max heap, you need to heapify the tree of size N. If K is negligibly small when compared to N, then this time can become a dominant factor. While in the case of min heap, heapify works on the smaller set of K. Also as mentioned in point 1, most of the value of N will not trigger an update of the tree.

    I wrote a small program to compare both the approach. The source can be accessed here.

    Results for an array ranging from 0 to 1M in random order is below:

    Test for Array size: 1000000
    
    k         MinHeapIterations   MinHeapTime(ms)     MaxHeapTime(ms)     MaxTime/MinTime     
    1         15                  6.07                72.03               11.88               
    10        114                 3.85                70.09               18.19               
    100       913                 4.11                69.60               16.93               
    1000      6874                5.32                72.94               13.71               
    10000     46123               16.52               79.89               4.83                
    100000    230385              78.19               132.27              1.69                
    1000000   0                   35.86               453.57              12.65
    

    As you can see,

    1. Min heap does outperforms Max heap for all values of K
    2. The no of tree updates incase of Min heap (MinHeapIterations) is very much less than (N-K)