Search code examples
vectorcudathrust

CUDA thrust vector: copy and sum the values from device_vectorA to device_vectorB


I'm new to CUDA.

I want to copy and sum values in device_vector in the following ways. Are there more efficient ways (or functions provided by thrust) to implement these?

thrust::device_vector<int> device_vectorA(5);
thrust::device_vector<int> device_vectorB(20);
  1. copydevice_vectorA 4 times into device_vectorB in the following way:
for (size_t i = 0; i < 4; i++)
    {
        offset_sta = i * 5;

        thrust::copy(device_vectorA.begin(), device_vectorA.end(), device_vectorB.begin() + offset_sta);
    }
  1. Sum every 5 values in device_vectorB and store the results in new device_vector (size 4):
// Example
device_vectorB = 1 2 3 4 5 | 1 2 3 4 5 | 1 2 3 4 5 | 1 2 3 4 5 
device_vectorC =     15          15           15           15
thrust::device_vector<int> device_vectorC(4);

for (size_t i = 0; i < 4; i++)
    {
        offset_sta = i * 5;
        offset_end = (i + 1) * 5 - 1;

        device_vectorC[i] = thrust::reduce(device_vectorB.begin() + offset_sta, device_vectorB.begin() + offset_end, 0);
    }

Are there more efficient ways (or functions provided by thrust) to implement these?

P.S. 1 and 2 are separate instances. For simplicity, these two instances just use the same vectors to illustrate.


Solution

  • Step 1 can be done with a single thrust::copy operation using a permutation iterator that uses a transform iterator working on a counting iterator to generate the copy indices "on the fly".

    Step 2 is a partitioned reduction, using thrust::reduce_by_key. We can again use a transform iterator working on a counting iterator to create the flags array "on the fly".

    Here is an example:

    $ cat t2124.cu
    #include <thrust/device_vector.h>
    #include <thrust/host_vector.h>
    #include <thrust/copy.h>
    #include <thrust/reduce.h>
    #include <thrust/sequence.h>
    #include <thrust/iterator/permutation_iterator.h>
    #include <thrust/iterator/transform_iterator.h>
    #include <thrust/iterator/counting_iterator.h>
    #include <thrust/iterator/discard_iterator.h>
    #include <iostream>
    
    using namespace thrust::placeholders;
    const int As = 5;
    const int Cs = 4;
    const int Bs = As*Cs;
    int main(){
    
      thrust::device_vector<int> A(As);
      thrust::device_vector<int> B(Bs);
      thrust::device_vector<int> C(Cs);
      thrust::sequence(A.begin(), A.end(), 1);  // fill A with 1,2,3,4,5
      thrust::copy_n(thrust::make_permutation_iterator(A.begin(), thrust::make_transform_iterator(thrust::counting_iterator<int>(0), _1%A.size())), B.size(), B.begin());  // step 1
      auto my_flags_iterator = thrust::make_transform_iterator(thrust::counting_iterator<int>(0), _1/A.size());
      thrust::reduce_by_key(my_flags_iterator, my_flags_iterator+B.size(), B.begin(), thrust::make_discard_iterator(), C.begin()); // step 2
      thrust::host_vector<int> Ch = C;
      thrust::copy_n(Ch.begin(), Ch.size(), std::ostream_iterator<int>(std::cout, ","));
      std::cout << std::endl;
    }
    $ nvcc -o t2124 t2124.cu
    $ compute-sanitizer ./t2124
    ========= COMPUTE-SANITIZER
    15,15,15,15,
    ========= ERROR SUMMARY: 0 errors
    $
    

    If we wanted to, even the device vector A could be dispensed with; that could be created "on the fly" using a counting iterator. But presumably your inputs are not actually 1,2,3,4,5