Search code examples
algorithmdata-structuresmapreduce

How to find top K when K is too large for memory


Let's say there is a large file with N elements and I want to find the top K largest elements from that file and output to a new file.

If N is large but K is small, I could use a min-heap of size K while doing a buffered read of the input file.

However, if K is also a huge number and we cannot hold K elements in memory, then the min-heap approach would not work.

In such scenarios, how can we get the desired output? I am wondering if we can do something similar to external merge sort where we have intermediate files on disk that we merge iteratively. However, I can't think of an exact solution


Solution

  • For some L that comfortably fits in memory, use Reservoir Sampling to pick a random set of L elements in a single passs.

    How many of the L elements do we expect to find in the top K? Well, let p = K/N. Each element is in the top k with probability p. This is a random variable with mean p and variance p*(1-p). Add together L of them and you get approximately a standard normal with mean p*L and variance p*(1-p)*L, and therefore standard deviation sqrt(p*(1-p)*L). (If p*L < 20, you're better off going to a Poisson distribution instead of a normal, but the same principle will apply.)

    So we can sort the sample, and find the bounds of a number of confidence intervals. Let's pick 9 points in the sample, such that they would be -5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5 standard deviations away from where we expect them to be. Then we can do a second pass and split into 10 files on those boundaries, with a decision tree that tests the ends, then the middle, then 2, -2, then finds the exact range if needed. Keep count of the number of things in those files.

    The result is that in 2 passes, with odds of under 1/million of failing, we've narrowed the result down to a file with N/sqrt(L) elements. Repeat until it is small enough.

    Even when we do fail, we've at least partitioned it down to a smaller file, and can repeat the second pass using new partition points based on what we've learned.

    If L is a million, then with odds of better than 99.9999%, 99% of the file winds up in/or out of our top k with only 2 reads, one write, and 2 comparisons. The rest of the work is a rounding error compared to that.

    Note that if the file is on a distributed file system, this calculation can be distributed and parallelized. The overall work remains similar, but the number of passes will increase. However with 2 passes reducing the size of the problem by a factor of 1000, the number of passes is O(log(N)) with a good constant.