When I try to call a CUDA kernel (a __global__
function) using a function pointer, everything appears to work just fine. However, if I forget to provide launch configuration when calling the kernel, NVCC will not result in an error or warning, but the program will compile and then crash if I attempt to run it.
__global__ void bar(float x) { printf("foo: %f\n", x); }
typedef void(*FuncPtr)(float);
void invoker(FuncPtr func)
{
func<<<1, 1>>>(1.0);
}
invoker(bar);
cudaDeviceSynchronize();
Compile and run the above. Everything will work just fine. Then, remove the kernel's launch configuration (i.e., <<<1, 1>>>). The code will compile just fine but it will crash when you try to run it.
Any idea what is going on? Is this a bug, or I am not supposed to pass around pointers of __global__
functions?
CUDA version: 8.0
OS version: Debian (Testing repo) GPU: NVIDIA GeForce 750M
If we take a slightly more complex version of your repro, and look at the code emitted by the CUDA toolchain front-end, it becomes possible to see what is happening:
#include <cstdio>
__global__ void bar_func(float x) { printf("foo: %f\n", x); }
typedef void(*FuncPtr)(float);
void invoker(FuncPtr passed_func)
{
#ifdef NVCC_FAILS_HERE
bar_func(1.0);
#endif
bar_func<<<1,1>>>(1.0);
passed_func(1.0);
passed_func<<<1,1>>>(2.0);
}
So let's compile it a couple of ways:
$ nvcc -arch=sm_52 -c -DNVCC_FAILS_HERE invoker.cu
invoker.cu(10): error: a __global__ function call must be configured
i.e. the front-end can detect that bar_func
is a global function and requires launch parameters. Another attempt:
$ nvcc -arch=sm_52 -c -keep invoker.cu
As you note, this produces no compile error. Let's look at what happened:
void bar_func(float x) ;
# 5 "invoker.cu"
typedef void (*FuncPtr)(float);
# 7 "invoker.cu"
void invoker(FuncPtr passed_func)
# 8 "invoker.cu"
{
# 12 "invoker.cu"
(cudaConfigureCall(1, 1)) ? (void)0 : (bar_func)((1.0));
# 13 "invoker.cu"
passed_func((2.0));
# 14 "invoker.cu"
(cudaConfigureCall(1, 1)) ? (void)0 : passed_func((3.0));
# 15 "invoker.cu"
}
The standard kernel invocation syntax <<<>>>
gets expanded into an inline call to cudaConfigureCall
, and then a host wrapper function is called. The host wrapper has the API internals required to launch the kernel:
void bar_func( float __cuda_0)
# 3 "invoker.cu"
{__device_stub__Z8bar_funcf( __cuda_0); }
void __device_stub__Z8bar_funcf(float __par0)
{
if (cudaSetupArgument((void *)(char *)&__par0, sizeof(__par0), (size_t)0UL) != cudaSuccess) return;
{ volatile static char *__f __attribute__((unused)); __f = ((char *)((void ( *)(float))bar_func));
(void)cudaLaunch(((char *)((void ( *)(float))bar_func)));
};
}
So the stub only handles arguments and launches the kernel via cudaLaunch
. It doesn't handle launch configuration
The underlying reason for the crash (actually an undetected runtime API error) is that the kernel launch happens without a prior configuration. Obviously this happens because the CUDA front end (and C++ for that matter) can't do pointer introspection at compile time and detect that your function pointer is a stub function for calling a kernel.
I think the only way to describe this is a "limitation" of the runtime API and compiler. I wouldn't say what you are doing is wrong, but I would probably be using the driver API and explicitly managing the kernel launch myself in such a situation.