Search code examples
cudanvcc

CUDA: Forgetting kernel launch configuration does not result in NVCC compiler warning or error


When I try to call a CUDA kernel (a __global__ function) using a function pointer, everything appears to work just fine. However, if I forget to provide launch configuration when calling the kernel, NVCC will not result in an error or warning, but the program will compile and then crash if I attempt to run it.

__global__ void bar(float x) { printf("foo: %f\n", x); }

typedef void(*FuncPtr)(float);

void invoker(FuncPtr func)
{
    func<<<1, 1>>>(1.0);
}

invoker(bar);
cudaDeviceSynchronize();

Compile and run the above. Everything will work just fine. Then, remove the kernel's launch configuration (i.e., <<<1, 1>>>). The code will compile just fine but it will crash when you try to run it.

Any idea what is going on? Is this a bug, or I am not supposed to pass around pointers of __global__ functions?

CUDA version: 8.0

OS version: Debian (Testing repo) GPU: NVIDIA GeForce 750M


Solution

  • If we take a slightly more complex version of your repro, and look at the code emitted by the CUDA toolchain front-end, it becomes possible to see what is happening:

    #include <cstdio>
    
    __global__ void bar_func(float x) { printf("foo: %f\n", x); }
    typedef void(*FuncPtr)(float);
    
    void invoker(FuncPtr passed_func)
    {
    #ifdef NVCC_FAILS_HERE
        bar_func(1.0);
    #endif
        bar_func<<<1,1>>>(1.0);
        passed_func(1.0);
        passed_func<<<1,1>>>(2.0);
    }
    

    So let's compile it a couple of ways:

    $ nvcc -arch=sm_52 -c -DNVCC_FAILS_HERE invoker.cu 
    invoker.cu(10): error: a __global__ function call must be configured
    

    i.e. the front-end can detect that bar_func is a global function and requires launch parameters. Another attempt:

    $ nvcc -arch=sm_52 -c -keep invoker.cu
    

    As you note, this produces no compile error. Let's look at what happened:

    void bar_func(float x) ;
    # 5 "invoker.cu"
    typedef void (*FuncPtr)(float);
    # 7 "invoker.cu"
    void invoker(FuncPtr passed_func)
    # 8 "invoker.cu"
    {
    # 12 "invoker.cu"
    (cudaConfigureCall(1, 1)) ? (void)0 : (bar_func)((1.0));
    # 13 "invoker.cu"
    passed_func((2.0));
    # 14 "invoker.cu"
    (cudaConfigureCall(1, 1)) ? (void)0 : passed_func((3.0));
    # 15 "invoker.cu"
    }
    

    The standard kernel invocation syntax <<<>>> gets expanded into an inline call to cudaConfigureCall, and then a host wrapper function is called. The host wrapper has the API internals required to launch the kernel:

    void bar_func( float __cuda_0)
    # 3 "invoker.cu"
    {__device_stub__Z8bar_funcf( __cuda_0); }
    
    void __device_stub__Z8bar_funcf(float __par0)
    {
        if (cudaSetupArgument((void *)(char *)&__par0, sizeof(__par0), (size_t)0UL) != cudaSuccess) return;
        { volatile static char *__f __attribute__((unused)); __f = ((char *)((void ( *)(float))bar_func)); 
          (void)cudaLaunch(((char *)((void ( *)(float))bar_func)));
        };
    }
    

    So the stub only handles arguments and launches the kernel via cudaLaunch. It doesn't handle launch configuration

    The underlying reason for the crash (actually an undetected runtime API error) is that the kernel launch happens without a prior configuration. Obviously this happens because the CUDA front end (and C++ for that matter) can't do pointer introspection at compile time and detect that your function pointer is a stub function for calling a kernel.

    I think the only way to describe this is a "limitation" of the runtime API and compiler. I wouldn't say what you are doing is wrong, but I would probably be using the driver API and explicitly managing the kernel launch myself in such a situation.