I'm new to CUDA.
I want to copy and sum values in device_vector
in the following ways. Are there more efficient ways (or functions provided by thrust) to implement these?
thrust::device_vector<int> device_vectorA(5);
thrust::device_vector<int> device_vectorB(20);
device_vectorA
4 times into device_vectorB
in the following way:for (size_t i = 0; i < 4; i++)
{
offset_sta = i * 5;
thrust::copy(device_vectorA.begin(), device_vectorA.end(), device_vectorB.begin() + offset_sta);
}
device_vectorB
and store the results in new device_vector (size 4):// Example
device_vectorB = 1 2 3 4 5 | 1 2 3 4 5 | 1 2 3 4 5 | 1 2 3 4 5
device_vectorC = 15 15 15 15
thrust::device_vector<int> device_vectorC(4);
for (size_t i = 0; i < 4; i++)
{
offset_sta = i * 5;
offset_end = (i + 1) * 5 - 1;
device_vectorC[i] = thrust::reduce(device_vectorB.begin() + offset_sta, device_vectorB.begin() + offset_end, 0);
}
Are there more efficient ways (or functions provided by thrust) to implement these?
P.S. 1 and 2 are separate instances. For simplicity, these two instances just use the same vectors to illustrate.
Step 1 can be done with a single thrust::copy operation using a permutation iterator that uses a transform iterator working on a counting iterator to generate the copy indices "on the fly".
Step 2 is a partitioned reduction, using thrust::reduce_by_key. We can again use a transform iterator working on a counting iterator to create the flags array "on the fly".
Here is an example:
$ cat t2124.cu
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/copy.h>
#include <thrust/reduce.h>
#include <thrust/sequence.h>
#include <thrust/iterator/permutation_iterator.h>
#include <thrust/iterator/transform_iterator.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/discard_iterator.h>
#include <iostream>
using namespace thrust::placeholders;
const int As = 5;
const int Cs = 4;
const int Bs = As*Cs;
int main(){
thrust::device_vector<int> A(As);
thrust::device_vector<int> B(Bs);
thrust::device_vector<int> C(Cs);
thrust::sequence(A.begin(), A.end(), 1); // fill A with 1,2,3,4,5
thrust::copy_n(thrust::make_permutation_iterator(A.begin(), thrust::make_transform_iterator(thrust::counting_iterator<int>(0), _1%A.size())), B.size(), B.begin()); // step 1
auto my_flags_iterator = thrust::make_transform_iterator(thrust::counting_iterator<int>(0), _1/A.size());
thrust::reduce_by_key(my_flags_iterator, my_flags_iterator+B.size(), B.begin(), thrust::make_discard_iterator(), C.begin()); // step 2
thrust::host_vector<int> Ch = C;
thrust::copy_n(Ch.begin(), Ch.size(), std::ostream_iterator<int>(std::cout, ","));
std::cout << std::endl;
}
$ nvcc -o t2124 t2124.cu
$ compute-sanitizer ./t2124
========= COMPUTE-SANITIZER
15,15,15,15,
========= ERROR SUMMARY: 0 errors
$
If we wanted to, even the device vector A
could be dispensed with; that could be created "on the fly" using a counting iterator. But presumably your inputs are not actually 1,2,3,4,5