Search code examples
algorithmsortingout-of-memorymergesort

How to sort multiple GB of data in different files?


I have 50 text files of around 10 gb of numbers. I have to sort these numbers. My first idea is to use apply Merge Sort i.e. Sort each file separately and merge them. I am using array to load these numbers. and when I run the application my program crashes due lack of memory. So My Question is:

  1. Which data structure to use?
  2. How to manage the memory?
  3. Is merge sort correct approach? if not please suggest the way.

Any help will be appreciated.


Solution

  • If the numbers only go up to 7 digits and are integers, then you can use a Counting Sort.

    You will only need around 40Mb of memory, to store 10 million 4 byte integers giving the count how many there are of each of the numbers from 0-9,999,999. If you have to deal with the possibility of more than 2.14 billion duplicates of the same number then you can use 64-bit ints. Initialize the array to zero then read the numbers in one at a time, updating the count for each. Then once you are finished you can generate the sorted list based on the counts.