Search code examples
cudathrustdynamic-parallelism

How to call a Thrust function in a stream from a kernel?


I want to make thrust::scatter asynchronous by calling it in a device kernel(I could also do it by calling it in another host thread). thrust::cuda::par.on(stream) is host function that cannot be called from a device kernel. The following code was tried with CUDA 10.1 on Turing architecture.


__global__ void async_scatter_kernel(float* first,
    float* last,
    int* map,
    float* output)
{
    cudaStream_t stream;
    cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking);
    thrust::scatter(thrust::cuda::par.on(stream), first, last, map, output);
    cudaDeviceSynchronize();
    cudaStreamDestroy(stream);
}

I know thrust uses dynamic parallelism to launch its kernels when called from the device, however I couldn't find a way to specify the stream.


Solution

  • The following code compiles cleanly for me on CUDA 10.1.243:

    $ cat t1518.cu
    #include <thrust/scatter.h>
    #include <thrust/execution_policy.h>
    
    __global__ void async_scatter_kernel(float* first,
        float* last,
        int* map,
        float* output)
    {
        cudaStream_t stream;
        cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking);
        thrust::scatter(thrust::cuda::par.on(stream), first, last, map, output);
        cudaDeviceSynchronize();
        cudaStreamDestroy(stream);
    }
    
    int main(){
    
      float *first = NULL;
      float *last = NULL;
      float *output = NULL;
      int *map = NULL;
      async_scatter_kernel<<<1,1>>>(first, last, map, output);
      cudaDeviceSynchronize();
    }
    $ nvcc -arch=sm_35 -rdc=true t1518.cu -o t1518
    $ nvcc --version
    nvcc: NVIDIA (R) Cuda compiler driver
    Copyright (c) 2005-2019 NVIDIA Corporation
    Built on Sun_Jul_28_19:07:16_PDT_2019
    Cuda compilation tools, release 10.1, V10.1.243
    $
    

    The -arch=sm_35 (or similar) and -rdc=true are necessary (but not in all cases sufficient) compile switches for any code that uses CUDA Dynamic Parallelism. If you omit, for example, the -rdc=true switch, you get an error similar to what you describe:

    $ nvcc -arch=sm_35 t1518.cu -o t1518
    t1518.cu(11): error: calling a __host__ function("thrust::cuda_cub::par_t::on const") from a __global__ function("async_scatter_kernel") is not allowed
    
    t1518.cu(11): error: identifier "thrust::cuda_cub::par_t::on const" is undefined in device code
    
    2 errors detected in the compilation of "/tmp/tmpxft_00003a80_00000000-8_t1518.cpp1.ii".
    $
    

    So, for the example you have shown here, your compilation error can be eliminated either by updating to the latest CUDA version or by specifying the proper command line, or both.