Search code examples
algorithmlarge-files

Finding k-largest elements of a very large file (while k is very LARGE)


Let's assume that we have a very large file which contains billions of integers , and we want to find k largest elements of these values ,

the tricky part is that k itself is very large too , which means we cannot keep k elements in the memory (for example we have a file with 100 billon elements and we want to find 10 billion largest elements)

How can we do this in O(n) ?

What I thought :

We start reading the file and we check it with another file which keeps the k largest elements (sorted in increasing order) , if the read element is larger than the first line of the second file we delete the first line and we insert it into the second file , the time complexity would be of O(NlogK) (if we have random access to that file , otherwise it would be 'O(Nk)'

Any idea to do this in O(n) , I guess if we have external version of Selection algorithm (the partitioning algorithm in quicksort) we would be able to do this in O(n) but I couldn't find it anywhere


Solution

  • PS: My definition of K is different. It is a smallish number say 2 or 100 or 1000. Here m corresponds to OPS's definition of k. Sorry about this.

    Depends on how many reads you can do of the original data and how much more space you have. This approach assumes you have extra space equivalent to the original data.

    Step 1: Pick K random numbers across the whole data
    Step 2: Sort the K numbers (assume index are from 1 to K)
    Step 3: Create K+1 separate files and name them 0 to K
    Step 4: For every element in the data, if it is between ith and i+th element put it in ith file.
    Step 5: Based on the size of each file, choose the file that is going to have mth number.
    Step 6: Repeat everything with the new file and new m (new_m = m - sum_of_size_of_all_lower_files)

    Regarding the last step, if K=2, m=1000 and size of file 0 is 800, 1 is 900 and 2 is 200, new_m = m-800 = 200 and work through file 1 iteratively.