c++parallel-processing cuda functor thrust

CUDA Thrust Functor with Flexibility to Run in CPU or GPU

This might be a stupid question, but I cannot seem to be able to find any resources specifically related to it, so your opinion is appreciated.

Let's say I have some functor

struct AddOne {
    thrust::device_ptr<int> numbers;

    __device__
    void operator()(int i) {
        numbers[i] = numbers[i] + 1;
    }
}

that i can call from

AddOne addOneFunctor;
thrust::device_vector<int> idx(100), numbers(100);
addOneFunctor.numbers = numbers.data();
thrust::sequence(idx.begin(), idx.end(), 0);
thrust::for_each(thrust::device, idx.begin(), idx.end(), addOneFunctor);

Is it is possible to write the above so that the execution policy can be changed at either compile-time or ideally run-time?

E.g. change the struct to

struct AddOne {
    thrust::pointer<int> numbers;

     __host__ __device__
     void operator()(int i) {
         numbers[i] = numbers[i] + 1;
     }
}

so it can be run something like

AddOne addOneFunctor;
std::vector<int> idx(100), numbers(100);
addOneFunctor.numbers = numbers.data();
thrust::sequence(idx.begin(), idx.end(), 0);
thrust::for_each(thrust::cpp::par, idx.begin(), idx.end(), addOneFunctor);

The bottom line is: I would like to have a single code-base where I can decide to either use thrust::device_vectors or some sort of host vector (such as std::vectors) and run these in the GPU (using thrust::device exec policy) or CPU (using thrust::cpp::par or similar policy) respectively.

PS: I would like to avoid PGI for now.

Solution

Yes, it's possible, pretty much exactly as you describe.

Here's a fully-worked example:

$ cat t1205.cu
#include <thrust/execution_policy.h>
#include <thrust/for_each.h>
#include <thrust/device_vector.h>
#include <thrust/sequence.h>
#include <iostream>
#include <vector>

struct AddOne {
    int *numbers;
     template <typename T>
     __host__ __device__
     void operator()(T &i) {
         numbers[i] = numbers[i] + 1;
     }
};

int main(){

  AddOne addOneFunctor;
  std::vector<int> idx(100), numbers(100);
  addOneFunctor.numbers = thrust::raw_pointer_cast(numbers.data());
  thrust::sequence(idx.begin(), idx.end(), 0);
  thrust::for_each(thrust::cpp::par, idx.begin(), idx.end(), addOneFunctor);
  for (int i = 0; i < 5; i++)
    std::cout << numbers[i] << ",";
  std::cout << std::endl;

  thrust::device_vector<int> didx(100), dnumbers(100);
  addOneFunctor.numbers = thrust::raw_pointer_cast(dnumbers.data());
  thrust::sequence(didx.begin(), didx.end(), 0);
  thrust::for_each(thrust::device, didx.begin(), didx.end(), addOneFunctor);
  for (int i = 0; i < 5; i++)
    std::cout << dnumbers[i] << ",";
  std::cout << std::endl;
}
$ nvcc -o t1205 t1205.cu
$ ./t1205
1,1,1,1,1,
1,1,1,1,1,
$

Note that the algorithim is thrust::sequence not thrust::seq.

Using CUDA 8RC

As @m.s. points out, the explict use of the execution policies on the algorithms for the codes above are not necessary - you can remove those and it will work the same way. However the formal usage of execution policy allows the above example to be extended to the case where you are not using containers, but ordinary host and device data, so it may still have some value, depending on your overall goals.