How to use child kernels (CUDA dynamic parallelism) using PyCUDA

My python code has a gpu kernel function which is called multiple times in a for loop from host like this :

for i in range:   
    gpu_kernel_func(blocksize, grid)

Since this function call requires communication between host and gpu device multiple times which is not efficient, I want to make this as

gpu_kernel_function(){  
    for(){ 
        computation } ;  
}

But this requires extra step to make sure all the blocks in grid are in sync. According to dynamic parallelism, calling a dummy child kernel should ensure that every thread (in whole grid) should finish that child kernel before the code continues running. So I defined another kernel just like gpu_kernel_function and I tried this :

GPUcode = '''

\__global__ gpu_kernel_function() {... }  
\__global__ dummy_child_kernel(){ ... }
'''

gpu_kernel_function(){  
    for() {
        computation } ;  
    dummy_child_kernel(void);  
}

But I am getting this error " nvcc fatal : Option '--cubin (-cubin)' is not allowed when compiling for a virtual compute architecture "

I am using Tesla P100 (compute 6.0), python 3.5, cuda.8.0.44. I am compiling my sourcemodule like this :

mod = SourceModule(GPUcode, options=['-rdc=true' ,'-lcudart','-lcudadevrt','--machine=64'],arch='compute_60' )

I tried compute_35 too which gives same error.

Solution

The error message is explicitly telling you what the issue is. compute_60 is a virtual architecture. You can't statically compile virtual architectures to machine code. They are intended for producing PTX (virtual machine assembler) for JIT translation to machine code by the runtime. PyCUDA compiles code to a binary payload ("cubin") using the CUDA toolchain and them loads it via the driver API into the CUDA context. Thus the error.

You can fix the error by specifying a valid physical GPU target architecture. So you should modify the source module constructor call to something like this:

mod = SourceModule(GPUcode, 
                   options=['-rdc=true','-lcudart','-lcudadevrt','--machine=64'],
                   arch='sm_60' )

This should fix the compiler error.

However, note that using dynamic parallelism requires device code linkage, and I am 99% sure that PyCUDA still doesn't support this, so you likely won't be able to do what you are asking about via a SourceModule. You could link your own cubin by hand using the compiler outside of PyCUDA and then load that cubin inside PyCUDA. You will find many examples of how to compile dynamic parallelism correctly if you search for them.