Search code examples
algorithmmergesortexternal-sorting

multi-way merge vs 2-way merge


When we externally merge sort a large file, we split it into small ones, sort those, and then merge them back into a large sorted file.

When merging, we can either do many 2-way merge passes, or one multi-way merge.

I am wondering which approach is better? and why?


Solution

  • One multi-way merge is generally better. Consider three small files:

    a1
    a2
    a3
    

    and

    b1
    b2
    b3
    

    and finally

    c1
    c2
    c3
    

    If you do a merge with a and b, we're left with (say)

    a1
    b1
    a2
    b2
    b3
    a3
    

    and

    c1
    c2
    c3
    

    A final merge would create the sorted list, but notice how in this final merge we have to visit the a and b items again. It's this re-merging that is wasteful in cascading two-way merges.

    What you can do instead is a single multi-way merge. However, be careful how you do it. Specifically, avoid the naive double-loop that scans each cursor to see which has the minimum value. Use a min-heap instead. This will bring the complexity back down to O(n log n).