Search code examples
c++sortingmemory-managementexternal-sorting

Sort 1TB file on machine with 1GB RAM


This questions seems easy, but I am not able to understand the real work behind it. I know people will say, break down into 512 Megs chunks and sort them like using Merge Sort using Map reduce.

So here is the actual question i have:

Suppose i break the file into 512 Megs chunk and then send to different host machines to sort them. suppose these machines used the Merge Sort. Now say, i had 2000 machines each sorted 2000, 512 megs of chunk. Now when i merge them back, how does that work? Won't the size keep on increasing again? For example merging two 512 megs will make 1024Megs which is size of my RAM so how would this work? Any machine can't merge a chunk of more than 512 megs chunk with another chunk because then size > 1 GB.

How at the end of merging will i ever be able to merge two 0.5 TB chunk with another 0.5 TB chunk.. Does the concept of Virtual Memory come into play here?

I am here to clarify my basics and i hope i am asking this very important question (correctly) correctly. Also, who should do this merge(after sorting)? My machine or few of those 2000 machines?


Solution

  • Here's a theoretical way which should work. Say you've got your 2000 512mb files, ready to create one 1TB file.

    If you simply loop through every file, find which one has the lowest FIRST value, then move that into your destination file, and repeat then you'll end up with everything in order. RAM usage should be tiny as you'll never need to open more than one line at a time.

    Obviously you should be able to optimize this - keep the first line of every file in RAM as you go and it should be somewhat faster.