Search code examples
mergesort

Ordering of files during the mergesort


I am interested in one topic, suppose that we have eight file, each containing 1 billion integers, and we should combine these files together into 8 billion integers files, all of them in each files are sorted. Of course, task is easy if we make 8 pass mergesort, but my question is, is it important ordering of files, in which order we should make combination on them? For example at the beginning, instead of combine first and second files, create new M file and combine with third file, maybe sometimes combination of second and third and then with first one would be more profitably? I think my question is clear. Does ordering of files during mergesort procedure matter? If so, how can we choose optimal one?


Solution

  • It's probably optimal to do an 8-way merge sort without intermediate files. Open 8 file handles, find the smallest integer from all 8, write that to the output file and read the next integer from that file. You could probably manage an 8-element array of your 8 sources (holding a file handle and the last value read), using an insertion sort.

    As far as ordering goes, if you could only merge two files at a time, I would probably merge the smallest files first. Simplify your example and you can see why.

    • Assume you have 3 files, with 1, 2 and 100 records in them.

    • If you merge the 1 & 2 into a temp file with 3 records, and then merge that with the 100, you'll have read 106 records and written 103.

    • If you instead merge the 1 & 100 into a temp file with 101 records, and then merge that with the 2, you'll have read 204 records and written 103.