Reduce multiple blocks of equal length that are arranged in a big vector Using CUDA

I am looking for a fast way to reduce multiple blocks of equal length that are arranged as a big vector. I have N subarrays(contiguous elements) that are arranged in one big array. each sub array has a fixed size : k. so the size of the whole array is : N*K

What I'm doing is to call the kernel N times. in each time it computes the reduction of the subarray as follow: I will iterate over all the subarrays contained in the big vector :

    for(i=0;i<N;i++){
       thrust::device_vector< float > Vec(subarray, subarray+k);
       float sum = thrust::reduce(Vec.begin(), Vec.end(), (float)0, thrust::plus<float>());
       printf("sum %f\n",sum);
 }

for pure CUDA i will do it like this (pseudo code):

 for(i=0;i<N;i++){
        reduction_kernel(subarray)

         }

do you have another solution to perform the reduction of the contiguous subarrays in once? using pure CUDA or Thrust

Solution

What you're asking for is a segmented reduction. This can be done in thrust using thrust::reduce_by_key In addition to your data vector of length N*K, we will need a "key" vector that defines each segment -- the segments don't have to be the same size, as long as the key vector differentiates segments like so:

data:  1 3 2 3 1 4 2 3 2 1 4 2 ...
keys:  0 0 0 1 1 1 0 0 0 3 3 3 ...
seg:   0 0 0 1 1 1 2 2 2 3 3 3 ...

The keys delineate a new segment any time the key sequence changes (note that I have two separate segments in the above example that are delineated using the same key - thrust doesn't group such segments together but treats them separately because there are 1 or more intervening key values that are different). You don't actually have this data, but for speed and efficiency, since your segments are of equal length, we can produce the necessary key sequence "on the fly" using a combination of thrust fancy iterators.

The fancy iterators will combine to:

produce a linear sequence 0 1 2 3 ... (via counting_iterator)
divide each member of the linear sequence by K, the segment length (via transform_iterator). I'm using thrust placeholder methodology here so I don't have to write a functor for the transform iterator.

This will produce the necessary segment-key sequence.

Here is a worked example:

$ cat t1282.cu
#include <thrust/reduce.h>
#include <thrust/device_vector.h>
#include <thrust/iterator/transform_iterator.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/discard_iterator.h>
#include <thrust/copy.h>
#include <thrust/execution_policy.h>
#include <iostream>

const int N = 1000;  // sequences
const int K = 100;   // length of sequence
typedef int mytype;

using namespace thrust::placeholders;

int main(){

  thrust::device_vector<mytype> data(N*K, 1);
  thrust::device_vector<mytype> sums(N);
  thrust::reduce_by_key(thrust::device, thrust::make_transform_iterator(thrust::counting_iterator<int>(0), _1/K), thrust::make_transform_iterator(thrust::counting_iterator<int>(N*K), _1/K), data.begin(), thrust::discard_iterator<int>(), sums.begin());
  // just display the first 10 results
  thrust::copy_n(sums.begin(), 10, std::ostream_iterator<mytype>(std::cout, ","));
  std::cout << std::endl;
}

$ nvcc -arch=sm_35 -o t1282 t1282.cu
$ ./t1282
100,100,100,100,100,100,100,100,100,100,
$