Search code examples
c++multithreadingcudathrust

Is THRUST stable_sort_by_key O(n)?


Can I assume that Thrust stable_sort_by_key performed on unsigned int has complexity O(n)? If not what should I do to be sure that this complexity will be achieved? (Except of implementing radix sort on my own)


Solution

  • It depends a bit on your circumstances/view. Just from the docs/API there doesn't seem to be a guarantee for thrust::stable_sort_by_key on unsigned int keys using a radix sort.

    On the other hand the necessary algorithm cub::DeviceRadixSort::SortPairs is implemented in the CUB library which is used by Thrust in the backend and there is no good reason for Thrust not to use it as the prerequisites can easily be queried at compile time.

    From the code in thrust/system/cuda/detail/sort.h(the "detail" should warn you that this is not part of the public API) one can see that thrust::stable_sort_by_key can launch a cub::DeviceRadixSort::SortPairs under the right circumstances (arithmetic key type and using thrust::less or thrust::greater as comparison operation) at least on the main branch of Thrust at the time of writing (ca. v2.0.0rc2) but probably for a long time already. Else it will fall back to a merge sort.

    Directly using cub::DeviceRadixSort::SortPairs could have benefits even if this is enough for you, as this makes it easier to reuse temporary buffers and avoid unnecessary synchronization. Both can be done in Thrust via a

    auto exec = thrust::cuda::par_nosync(custom_allocator).on(custom_stream);
    

    execution policy (the most recent CUDA Toolkit 11.8 still comes with old Thrust v1.15 without v1.16 par_nosync). One thing that can not be avoided using Thrust is the in-place nature of the sorting algorithms which is achieved by potentially copying the results back to the input buffers. These copy operations can only be elided using CUB.