Search code examples
sortingvectorcudagputhrust

Sorting multiple arrays using CUDA/Thrust


I have a large array that I need to sort on the GPU. The array itself is a concatenation of multiple smaller subarrays that satisfy the condition that given i < j, the elements of the subarray i are smaller than the elements of the subarray j. An example of such array would be {5 3 4 2 1 6 9 8 7 10 11}, where the elements of the first subarray of 5 elements are smaller than the elements of the second subarray of 6 elements. The array I need is {1, 2, 3, 4, 5, 6, 7, 10, 11}. I know the position where each subarray starts in the large array.

I know I can simply use thrust::sort on the whole array, but I was wondering if it's possible to launch multiple concurrent sorts, one for each subarray. I'm hoping to get a performance improvement by doing that. My assumption is that it would be faster to sort multiple smaller arrays than one large array with all the elements.

I'd appreciate if someone could give me a way to do that or correct my assumption in case it's wrong.


Solution

  • A way to do multiple concurrent sorts (a "vectorized" sort) in thrust is via the marking of the sub arrays, and providing a custom functor that is an ordinary thrust sort functor that also orders the sub arrays by their key.

    Another possible method is to use back-to-back thrust::stable_sort_by_key as described here.

    As you have pointed out, another method in your case is just to do an ordinary sort, since that is ultimately your objective.

    However I think its unlikely that any of the thrust sort methods will give a signficant speed-up over a pure sort, although you can try it. Thrust has a fast-path radix sort which it will use in certain situations, which the pure sort method could probably use in your case. (In other cases, e.g. when you provide a custom functor, thrust will often use a slower merge-sort method.)

    If the sizes of the sub arrays are within certain ranges, I think you're likely to get much better results (performance-wise) with block radix sort in cub, one block per sub-array.

    Here is an example that uses specific sizes (since you've given no indication of size ranges and other details), comparing a thrust "pure sort" to a thrust segmented sort with functor, to the cub block sort method. For this particular case, the cub sort is fastest:

    $ cat t1.cu
    #include <thrust/device_vector.h>
    #include <thrust/host_vector.h>
    #include <thrust/sort.h>
    #include <thrust/scan.h>
    #include <thrust/equal.h>
    #include <cstdlib>
    #include <iostream>
    
    
    #include <time.h>
    #include <sys/time.h>
    #define USECPSEC 1000000ULL
    
    const int num_blocks = 2048;
    const int items_per = 4;
    const int nTPB = 512;
    const int block_size = items_per*nTPB; // must be a whole-number multiple of nTPB;
    typedef float mt;
    
    unsigned long long dtime_usec(unsigned long long start){
    
      timeval tv;
      gettimeofday(&tv, 0);
      return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
    }
    
    struct my_sort_functor
    {
            template <typename T, typename T2>
            __host__ __device__
            bool operator()(T t1, T2 t2){
                    if (thrust::get<1>(t1) < thrust::get<1>(t2)) return true;
                    if (thrust::get<1>(t1) > thrust::get<1>(t2)) return false;
                    if (thrust::get<0>(t1) > thrust::get<0>(t2)) return false;
                    return true;}
    };
    
    // from: https://nvlabs.github.io/cub/example_block_radix_sort_8cu-example.html#_a0
    #define CUB_STDERR
    #include <stdio.h>
    #include <iostream>
    #include <algorithm>
    #include <cub/block/block_load.cuh>
    #include <cub/block/block_store.cuh>
    #include <cub/block/block_radix_sort.cuh>
    using namespace cub;
    //---------------------------------------------------------------------
    // Globals, constants and typedefs
    //---------------------------------------------------------------------
    bool g_verbose = false;
    bool g_uniform_keys;
    //---------------------------------------------------------------------
    // Kernels
    //---------------------------------------------------------------------
    template <
        typename    Key,
        int         BLOCK_THREADS,
        int         ITEMS_PER_THREAD>
    __launch_bounds__ (BLOCK_THREADS)
    __global__ void BlockSortKernel(
        Key         *d_in,          // Tile of input
        Key         *d_out)         // Tile of output
    {
        enum { TILE_SIZE = BLOCK_THREADS * ITEMS_PER_THREAD };
        // Specialize BlockLoad type for our thread block (uses warp-striped loads for coalescing, then transposes in shared memory to a blocked arrangement)
        typedef BlockLoad<Key, BLOCK_THREADS, ITEMS_PER_THREAD, BLOCK_LOAD_WARP_TRANSPOSE> BlockLoadT;
        // Specialize BlockRadixSort type for our thread block
        typedef BlockRadixSort<Key, BLOCK_THREADS, ITEMS_PER_THREAD> BlockRadixSortT;
        // Shared memory
        __shared__ union TempStorage
        {
            typename BlockLoadT::TempStorage        load;
            typename BlockRadixSortT::TempStorage   sort;
        } temp_storage;
        // Per-thread tile items
        Key items[ITEMS_PER_THREAD];
        // Our current block's offset
        int block_offset = blockIdx.x * TILE_SIZE;
        // Load items into a blocked arrangement
        BlockLoadT(temp_storage.load).Load(d_in + block_offset, items);
        // Barrier for smem reuse
        __syncthreads();
        // Sort keys
        BlockRadixSortT(temp_storage.sort).SortBlockedToStriped(items);
        // Store output in striped fashion
        StoreDirectStriped<BLOCK_THREADS>(threadIdx.x, d_out + block_offset, items);
    }
    
    int main(){
            const int ds = num_blocks*block_size;
            thrust::host_vector<mt>      data(ds);
            thrust::host_vector<int>     keys(ds);
            for (int i = block_size; i < ds; i+=block_size) keys[i] = 1; // mark beginning of blocks
            thrust::device_vector<int> d_keys = keys;
            for (int i = 0; i < ds; i++) data[i] = (rand()%block_size) + (i/block_size)*block_size;  // populate data
            thrust::device_vector<mt>  d_data = data;
            thrust::inclusive_scan(d_keys.begin(), d_keys.end(), d_keys.begin());  // fill out keys array  000111222...
            thrust::device_vector<mt> d1 = d_data;  // make a copy of unsorted data
            cudaDeviceSynchronize();
            unsigned long long os = dtime_usec(0);
            thrust::sort(d1.begin(), d1.end());  // ordinary sort
            cudaDeviceSynchronize();
            os = dtime_usec(os);
            thrust::device_vector<mt> d2 = d_data;  // make a copy of unsorted data
            cudaDeviceSynchronize();
            unsigned long long ss = dtime_usec(0);
            thrust::sort(thrust::make_zip_iterator(thrust::make_tuple(d2.begin(), d_keys.begin())), thrust::make_zip_iterator(thrust::make_tuple(d2.end(), d_keys.end())), my_sort_functor());
            cudaDeviceSynchronize();
            ss = dtime_usec(ss);
            if (!thrust::equal(d1.begin(), d1.end(), d2.begin())) {std::cout << "oops1" << std::endl; return 0;}
            std::cout << "ordinary thrust sort: " << os/(float)USECPSEC << "s " << "segmented sort: " << ss/(float)USECPSEC << "s" << std::endl;
            thrust::device_vector<mt> d3(ds);
            cudaDeviceSynchronize();
            unsigned long long cs = dtime_usec(0);
            BlockSortKernel<mt, nTPB, items_per><<<num_blocks, nTPB>>>(thrust::raw_pointer_cast(d_data.data()),  thrust::raw_pointer_cast(d3.data()));
            cudaDeviceSynchronize();
            cs = dtime_usec(cs);
            if (!thrust::equal(d1.begin(), d1.end(), d3.begin())) {std::cout << "oops2" << std::endl; return 0;}
            std::cout << "cub sort: " << cs/(float)USECPSEC << "s" << std::endl;
    }
    $ nvcc -o t1 t1.cu
    $ ./t1
    ordinary thrust sort: 0.001652s segmented sort: 0.00263s
    cub sort: 0.000265s
    $
    

    (CUDA 10.2.89, Tesla V100, Ubuntu 18.04)

    I have no doubt that your sizes and array dimensions don't correspond to mine. The purpose here is to illustrate some possible methods, not a black-box solution that works for your particular case. You probably should do benchmark comparisons of your own. I also acknowledge that the block radix sort method for cub expects equal-sized sub-arrays, which you may not have. It may not be a suitable method for you, or you may wish to explore some kind of padding arrangement. There's no need to ask this question of me; I won't be able to answer it based on the information in your question.

    I don't claim correctness for this code or any other code that I post. Anyone using any code I post does so at their own risk. I merely claim that I have attempted to address the questions in the original posting, and provide some explanation thereof. I am not claiming my code is defect-free, or that it is suitable for any particular purpose. Use it (or not) at your own risk.