Search code examples
c++cudathrust

How to use thrust to accumulate array based on index?


I am trying to accumulate array based on index. My inputs are two vectors with same length. 1st vector is the index. 2nd vector are the value. My goal is to accumulate the value based on index. I have a similar code in c++. But I am new in thrust coding. Could I achieve this with thrust device code? Which function could I use? I found no "map" like functions. Is it more efficient than the CPU(host) code? My c++ version mini sample code.

int a[10]={1,2,3,4,5,1,1,3,4,4};
vector<int> key(a,a+10);
double b[10]={1,2,3,4,5,1,2,3,4,5};
vector<double> val(b,b+10);

unordered_map<size_t,double> M;
for (size_t i = 0;i< 10 ;i++)
{
    M[key[i]] = M[key[i]]+val[i];
}

Solution

  • As indicated in the comment, the canonical way to do this would be to reorder the data (keys, values) so that like keys are grouped together. You can do this with sort_by_key. reduce_by_key then solves.

    It is possible, in a slightly un-thrust-like way, to also solve the problem without reordering, using a functor provided to for_each, that has an atomic.

    The following illustrates both:

    $ cat t27.cu
    #include <thrust/reduce.h>
    #include <thrust/sort.h>
    #include <thrust/device_vector.h>
    #include <thrust/iterator/zip_iterator.h>
    #include <thrust/for_each.h>
    #include <thrust/copy.h>
    #include <iostream>
    #include <unordered_map>
    #include <vector>
    
    // this functor only needed for the non-reordering case
    // requires compilation for a cc6.0 or higher GPU e.g. -arch=sm_60
    struct my_func {
      double *r;
      my_func(double *_r) : r(_r) {};
      template <typename T>
      __host__ __device__
      void operator()(T t) {
        atomicAdd(r+thrust::get<0>(t)-1, thrust::get<1>(t));  // assumes consecutive keys starting at 1
      }
    };
    
    int main(){
    
      int a[10]={1,2,3,4,5,1,1,3,4,4};
      std::vector<int> key(a,a+10);
      double b[10]={1,2,3,4,5,1,2,3,4,5};
      std::vector<double> val(b,b+10);
    
      std::unordered_map<size_t,double> M;
      for (size_t i = 0;i< 10 ;i++)
      {
        M[key[i]] = M[key[i]]+val[i];
      }
      for (int i = 1; i < 6; i++) std::cout << M[i] << " ";
      std::cout << std::endl;
      int size_a = sizeof(a)/sizeof(a[0]);
      thrust::device_vector<int>    d_a(a, a+size_a);
      thrust::device_vector<double> d_b(b, b+size_a);
      thrust::device_vector<double> d_r(5); //assumes only 5 keys, for illustration
      thrust::device_vector<int> d_k(5); // assumes only 5 keys, for illustration
      // method 1, without reordering
      thrust::for_each_n(thrust::make_zip_iterator(thrust::make_tuple(d_a.begin(), d_b.begin())), size_a, my_func(thrust::raw_pointer_cast(d_r.data())));
      thrust::host_vector<double> r = d_r;
      thrust::copy(r.begin(), r.end(), std::ostream_iterator<double>(std::cout, " "));
      std::cout << std::endl;
      thrust::fill(d_r.begin(), d_r.end(), 0.0);
      // method 2, with reordering
      thrust::sort_by_key(d_a.begin(), d_a.end(), d_b.begin());
      thrust::reduce_by_key(d_a.begin(), d_a.end(), d_b.begin(), d_k.begin(), d_r.begin());
      thrust::copy(d_r.begin(), d_r.end(), r.begin());
      thrust::copy(r.begin(), r.end(), std::ostream_iterator<double>(std::cout, " "));
      std::cout << std::endl;
    }
    $ nvcc -o t27 t27.cu -std=c++14 -arch=sm_70
    $ ./t27
    4 2 6 13 5
    4 2 6 13 5
    4 2 6 13 5
    $
    

    I make no statements about relative performance of these approaches. It would probably depend on the actual data set size, and possibly the GPU being used and other factors.