Let's say there is a large file with N elements and I want to find the top K largest elements from that file and output to a new file.
If N is large but K is small, I could use a min-heap of size K while doing a buffered read of the input file.
However, if K is also a huge number and we cannot hold K elements in memory, then the min-heap approach would not work.
In such scenarios, how can we get the desired output? I am wondering if we can do something similar to external merge sort where we have intermediate files on disk that we merge iteratively. However, I can't think of an exact solution
For some L
that comfortably fits in memory, use Reservoir Sampling to pick a random set of L
elements in a single passs.
How many of the L
elements do we expect to find in the top K
? Well, let p = K/N
. Each element is in the top k
with probability p
. This is a random variable with mean p
and variance p*(1-p)
. Add together L
of them and you get approximately a standard normal with mean p*L
and variance p*(1-p)*L
, and therefore standard deviation sqrt(p*(1-p)*L)
. (If p*L < 20
, you're better off going to a Poisson distribution instead of a normal, but the same principle will apply.)
So we can sort the sample, and find the bounds of a number of confidence intervals. Let's pick 9 points in the sample, such that they would be -5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5 standard deviations away from where we expect them to be. Then we can do a second pass and split into 10 files on those boundaries, with a decision tree that tests the ends, then the middle, then 2, -2, then finds the exact range if needed. Keep count of the number of things in those files.
The result is that in 2 passes, with odds of under 1/million of failing, we've narrowed the result down to a file with N/sqrt(L)
elements. Repeat until it is small enough.
Even when we do fail, we've at least partitioned it down to a smaller file, and can repeat the second pass using new partition points based on what we've learned.
If L
is a million, then with odds of better than 99.9999%, 99% of the file winds up in/or out of our top k
with only 2 reads, one write, and 2 comparisons. The rest of the work is a rounding error compared to that.
Note that if the file is on a distributed file system, this calculation can be distributed and parallelized. The overall work remains similar, but the number of passes will increase. However with 2 passes reducing the size of the problem by a factor of 1000, the number of passes is O(log(N))
with a good constant.