Search code examples
cudagputhrustreduction

CUDA Thrust: reduce_by_key on only some values in an array, based off values in a "key" array


Let's say I have two device_vector<byte> arrays, d_keys and d_data.

If d_data is, for example, a flattened 2D 3x5 array ( e.g. { 1, 2, 3, 4, 5, 6, 7, 8, 9, 8, 7, 6, 5, 4, 3 } ) and d_keys is a 1D array of size 5 ( e.g. { 1, 0, 0, 1, 1 } ), how can I do a reduction such that I'd end up only adding values on a per-row basis if the corresponding d_keys value is one ( e.g. ending up with a result of { 10, 23, 14 } )?

The sum_rows.cu example allows me to add every value in d_data, but that's not quite right.

Alternatively, I can, on a per-row basis, use a zip_iterator and combine d_keys with one row of d_data at a time, and do a transform_reduce, adding only if the key value is one, but then I'd have to loop through the d_data array.

What I really need is some sort of transform_reduce_by_key functionality that isn't built-in, but surely there must be a way to make it!


Solution

  • Based on the additional comment that instead of 3 rows there are thousands of rows, we can write a transform functor that sums an entire row. Based on the fact that there are thousands of rows, this should keep the machine pretty busy:

    #include <iostream>
    #include <thrust/host_vector.h>
    #include <thrust/device_vector.h>
    #include <thrust/transform.h>
    #include <thrust/sequence.h>
    #include <thrust/fill.h>
    
    #define ROW   20
    #define COL   10
    
    __device__ int *vals;
    __device__ int *keys;
    
    struct test_functor
    {
      const int a;
    
      test_functor(int _a) : a(_a) {}
    
      __device__
      int operator()(int& x, int& y ) {
        int temp = 0;
        for (int i = 0; i<a; i++)
          temp += vals[i + (y*a)] * keys[i];
        return temp;
        }
    };
    
    int main(){
      int *s_vals, *s_keys;
      thrust::host_vector<int> h_vals(ROW*COL);
      thrust::host_vector<int> h_keys(COL);
      thrust::sequence(h_vals.begin(), h_vals.end());
      thrust::fill(h_keys.begin(), h_keys.end(), 1);
      h_keys[0] = 0;
      thrust::device_vector<int> d_vals = h_vals;
      thrust::device_vector<int> d_keys = h_keys;
      thrust::device_vector<int> d_sums(ROW);
      thrust::fill(d_sums.begin(), d_sums.end(), 0);
      s_vals = thrust::raw_pointer_cast(&d_vals[0]);
      s_keys = thrust::raw_pointer_cast(&d_keys[0]);
      cudaMemcpyToSymbol(vals, &s_vals, sizeof(int *));
      cudaMemcpyToSymbol(keys, &s_keys, sizeof(int *));
      thrust::device_vector<int> d_idx(ROW);
      thrust::sequence(d_idx.begin(), d_idx.end());
      thrust::transform(d_sums.begin(), d_sums.end(), d_idx.begin(),  d_sums.begin(), test_functor(COL));
      thrust::host_vector<int> h_sums = d_sums;
      std::cout << "Results :" << std::endl;
      for (unsigned i = 0; i<ROW; i++)
        std::cout<<"h_sums["<<i<<"] = " << h_sums[i] << std::endl;
      return 0;
    }
    

    This approach has the drawback that in general accesses to the vals array will not be coalesced. However for a few thousand rows the cache may offer significant relief. We can fix this problem by re-ordering the data to be stored in column-major form in the flattened array, and change our indexing method in the loop in the functor to be like this:

    for (int i=0; i<a; i++)
      temp += vals[(i*ROW)+y]*keys[i];
    

    If preferred, you can pass ROW as an additional parameter to the functor.