algorithm performance sorting optimization language-agnostic

Sorting a small array into a large sorted array

What is the best algorithm for merging a large sorted array with a small unsorted array?

I'll give examples of what I mean from my particular use case, but don't feel bound by them: I'm mostly trying to give a feel for the problem.

8 MB sorted array with 92 kB unsorted array (in-cache sort)
2.5 GB sorted array with 3.9 MB unsorted array (in-memory sort)
34 GB sorted array with 21 MB unsorted array (out-of-memory sort)

Solution

You can implement a chunk-based algorithm to solve this problem efficiently (whatever the input size of the arrays as long as one is much smaller than the other).

First of all, you need to sort the small array (possibly using a radix sort or a bitonic sort if you do not need a custom comparator). Then the idea is to cut the big array in chunks fully fitting in the CPU cache (eg. 256 KiB). For each chunk, find the index of the last item in the small array <= to the last item of the chunk using a binary search. This is relatively fast because the small array likely fit in the cache and the same items of the binary search are fetched between consecutive chunks if the array is big. This index enable you to know how many items need to be merged with the chunks before being written. For each value to be merged in the chunk, find the index of the value using a binary search in the chunk. This is fast because the chunk fit in the cache. Once you know the index of the values to be inserted in the chunk, you can efficiently move the item by block in each chunk (possibly in-place from the end to the beginning). This implementation is much faster than the traditional merge algorithm since the number of comparison needed is much smaller thanks to the binary search and small number of items to be inserted by chunk.

For relatively big input, you can use a parallel implementation. The idea is to work on a group of multiple chunks at the same time (ie. super-chunks). Super-chunks are much bigger than classical ones (eg. >=2 MiB). Each thread work on a super-chunk at a time. A binary search is performed on the small array to know how many values are inserted in each super-chunk. This number is shared between threads so that each threads know where it can safely write the output independently of other thread (one could use a parallel-scan algorithm to do that on massively parallel architecture). Each super-chunk is then split in classical chunks and the previous algorithm is used to solve the problem in each thread independently. This method should be more efficient even in sequential when the small input arrays do not fit in the cache since the number of binary search operations in the whole small array will be significantly reduced.

The (amortized) time complexity of the algorithm is O(n (1 + log(m) / c) + m (1 + log(c))) with m the length of the big array, n the length of the small array and c the chunk size (super-chunks are ignored here for sake of clarity, but they only change the complexity by a constant factor like the constant c does).

Alternative method / Optimization: If your comparison operator is cheap and can be vectorized using SIMD instructions, then you can optimize the traditional merge algorithm. The traditional method is quite slow because of branches (that can hardly be predicted in the general case) and also because it cannot be easily/efficiently vectorized. However, because the big array is much bigger than the small array, the traditional algorithm will pick a lot of consecutive value from the big array in between the ones of the small array. This means that you can pick SIMD chunks of the big array and compare the values with one of the small array. If all SIMD items are smaller than the one picked from the small array, then you can write the whole SIMD chunk at once very efficiently. Otherwise, you need to write a part of the SIMD chunk, then write the item of the small array and switch to the next one. This last operation is clearly less efficient but it should happen rarely since the small array is much smaller than the big one. Note that the small array still needs to be sorted first.