Memory saving alternatives to CUDA Thrust sort_by_key?

In cuda/thrust: Trying to sort_by_key 2.8GB of data in 6GB of GPU RAM throws bad_alloc, I have read that sort_by_key consumed most of the memory for the test case considered therein.

Is there an alternative that can do exactly what sort_by_key is doing even if it is a little bit slower but that can sort bigger datasets?

Solution

I have been searching around a bit and I think the following is a candidate answer to your question.

Quoting N. Bell and J. Hoberock, "Thrust: a productivity-oriented library for CUDA", in GPU Computing Gems Jade Edition:

Thrust statically selects a highly-optimized Radix Sort algorithm for sorting primitive types (char, int, float and double) with the standard less and greater comparison operators. For all other types (e.g., user-defined data types) and comparison operators, Thrust uses a general Merge Sort algorithm. Because sorting primitives with Radix Sort is considerably faster than Merge Sort, this static optimization has significant value.

Now, Merge Sort requires O(N) memory space, see Space requirements of a merge-sort.

Furthermore, Radix Sort still requires O(N) memory space, see Comparison of Bucket Sort and RADIX Sort.

Which of the two consumes more memory is not defined and depends on the input sequence to be sorted as well as on algorithm tuning parameters, see the comments to one of the answers to Why quicksort is more popular than radix-sort?.

Opposite to that, Quick Sort requires O(logN) memory space if performed in-place, otherwise it needs O(N). For a CUDA implementation of the Quick Sort algorithm, you may have a look at How Tesla K20 Speeds Quicksort.

For other in-place sorting algorithms (the in-place strategy is worth to be explored as it saves memory as compared to the non-in-place counterpart), have a look at the Bitonic Sort, see Fast in-place sorting with CUDA based on bitonic sort.