Search code examples
c++cudagputhrust

Calling thrust function inside a CUDA Kernel __global___


I've read about that the dynamic parallelism is supported in the newer version of CUDA, and I can call thrust functions like thrush::exclusive_scan inside a kernel function with thrust::device parameter.

__global__ void kernel(int* inarray, int n, int *result) {
  extern __shared__ int s[];
  int t = threadIdx.x;

  s[t] = inarray[t];
  __syncthreads();

  thrust::exclusive_scan(thrust::device, s, n, result);
  __syncthreads();
}

int main() {
  // prep work

  kernel<<<1, n, n * sizeof(int)>>>(inarray, n, result);
}

The thing I got confused is:

  1. When calling thrust function inside a kernel, does each thread call the function once and they all do a dynamic parallelism on the data?
  2. If they do, I only need one thread to call thrust so I can just do a if to threadIdx; if not, how do threads in a block communicate with each other that the call to thrust has been done and they should just ignore it(this seems a little imaginary since there wouldn't be a systematical way to ensure from user's code). To summerize, what's exactly happening when I call thrust functions with thrust::device parameter inside a kernel?

Solution

    1. Every thread in your kernel that executes the thrust algorithm will execute a separate copy of your algorithm. The threads in your kernel do not cooperate on a single algorithm call.

    2. If you have met all the requirements (HW/SW and compilation settings) for a CUDA dynamic parallelism (CDP) call, then each thread that encounters the thrust algorithm call will launch a CDP child kernel to perform the thrust algorithm (in that case, the threads in the CDP child kernel do cooperate). If not, each thread that encounters the thrust algorithm call will perform it as if you had specified thrust::seq instead of thrust::device.

    3. If you prefer to avoid the CDP activity in an otherwise CDP-capable environment, you can specify thrust::seq instead.

    4. If you intend, for example, that only a single copy of your thrust algorithm be executed, it will be necessary in your kernel code to ensure that only one thread calls it, for example:

      if (!threadIdx.x) thrust::exclusive_scan(...  
      

      or similar.

    5. Questions around synchronization before/after the call are no different from ordinary CUDA code. If you need all threads in the block to wait for the thrust algorithm to complete, use e.g. __syncthreads(), (and cudaDeviceSynchronize() in the CDP case).

    The information here may possibly be of interest as well.