Search code examples
sortingcudathrust

How to partly sort arrays on CUDA?


Problem

Provided I have two arrays:

   const int N = 1000000;
   float A[N];
   myStruct *B[N];

The numbers in A can be positive or negative (e.g. A[N]={3,2,-1,0,5,-2}), how can I make the array A partly sorted (all positive values first, not need to be sorted, then negative values)(e.g. A[N]={3,2,5,0,-1,-2} or A[N]={5,2,3,0,-2,-1}) on the GPU? The array B should be changed according to A (A is keys, B is values).

Since the scale of A,B can be very large, I think the sort algorithm should be implemented on GPU (especially on CUDA, because I use this platform). Surely I know thrust::sort_by_key can do this work, but it does muck extra work since I do not need the array A&B to be sorted entirely.

Has anyone come across this kind of problem?

Thrust example

thrust::sort_by_key(thrust::device_ptr<float> (A), 
            thrust::device_ptr<float> ( A + N ),  
            thrust::device_ptr<myStruct> ( B ),  
            thrust::greater<float>() );

Solution

  • Thrust's documentation on Github is not up-to-date. As @JaredHoberock said, thrust::partition is the way to go since it now supports stencils. You may need to get a copy from the Github repository:

    git clone git://github.com/thrust/thrust.git

    Then run scons doc in the Thrust folder to get an updated documentation, and use these updated Thrust sources when compiling your code (nvcc -I/path/to/thrust ...). With the new stencil partition, you can do:

    #include <thrust/partition.h>
    #include <thrust/execution_policy.h>
    #include <thrust/iterator/zip_iterator.h>
    #include <thrust/tuple.h>
    
    struct is_positive
    {
    __host__ __device__
    bool operator()(const int &x)
    {
      return x >= 0;
    }
    };
    
    
    thrust::partition(thrust::host, // if you want to test on the host
                      thrust::make_zip_iterator(thrust::make_tuple(keyVec.begin(), valVec.begin())),
                      thrust::make_zip_iterator(thrust::make_tuple(keyVec.end(), valVec.end())),
                      keyVec.begin(),
                      is_positive());
    

    This returns:

    Before:
      keyVec =   0  -1   2  -3   4  -5   6  -7   8  -9
      valVec =   0   1   2   3   4   5   6   7   8   9
    After:
      keyVec =   0   2   4   6   8  -5  -3  -7  -1  -9
      valVec =   0   2   4   6   8   5   3   7   1   9
    

    Note that the 2 partitions are not necessarily sorted. Also, the order may differ between the original vectors and the partitions. If this is important to you, you can use thrust::stable_partition:

    stable_partition differs from partition in that stable_partition is guaranteed to preserve relative order. That is, if x and y are elements in [first, last), such that pred(x) == pred(y), and if x precedes y, then it will still be true after stable_partition that x precedes y.

    If you want a complete example, here it is:

    #include <thrust/host_vector.h>
    #include <thrust/device_vector.h>
    #include <thrust/partition.h>
    #include <thrust/iterator/zip_iterator.h>
    #include <thrust/tuple.h>
    
    struct is_positive
    {
    __host__ __device__
    bool operator()(const int &x)
    {
      return x >= 0;
    }
    };
    
    void print_vec(const thrust::host_vector<int>& v)
    {
      for(size_t i = 0; i < v.size(); i++)
        std::cout << "  " << v[i];
      std::cout << "\n";
    }
    
    int main ()
    {
      const int N = 10;
    
      thrust::host_vector<int> keyVec(N);
      thrust::host_vector<int> valVec(N);
    
      int sign = 1;
      for(int i = 0; i < N; ++i)
      {
        keyVec[i] = sign * i;
        valVec[i] = i;
        sign *= -1;
      }
    
      // Copy host to device
      thrust::device_vector<int> d_keyVec = keyVec;
      thrust::device_vector<int> d_valVec = valVec;
    
      std::cout << "Before:\n  keyVec = ";
      print_vec(keyVec);
      std::cout << "  valVec = ";
      print_vec(valVec);
    
      // Partition key-val on device
      thrust::partition(thrust::make_zip_iterator(thrust::make_tuple(d_keyVec.begin(), d_valVec.begin())),
                        thrust::make_zip_iterator(thrust::make_tuple(d_keyVec.end(), d_valVec.end())),
                        d_keyVec.begin(),
                        is_positive());
    
      // Copy result back to host
      keyVec = d_keyVec;
      valVec = d_valVec;
    
      std::cout << "After:\n  keyVec = ";
      print_vec(keyVec);
      std::cout << "  valVec = ";
      print_vec(valVec);
    }
    

    UPDATE

    I made a quick comparison with the thrust::sort_by_key version, and the thrust::partition implementation does seem to be faster (which is what we could naturally expect). Here is what I obtain on NVIDIA Visual Profiler, with N = 1024 * 1024, with the sort version on the left, and the partition version on the right. You may want to do the same kind of tests on your own.

    Sort vs Partition