Search code examples
cudafunction-pointers

Why is CUDA function cudaLaunchKernel passed a function pointer to host-code function?


I compile axpy.cu with the following command.

nvcc --cuda axpy.cu -o axpy.cu.cpp.ii

Within axpy.cu.cpp.ii, I observe that function cudaLaunchKernel nested in __device_stub__Z4axpyfPfS_ accepts an function pointer to axpy which is defined in axpy.cu.cpp.ii. So my confuse is that shouldn't cudaLaunchKernel have been passed an function pointer to kernel function axpy? Why is there function definition with the same name as kernel function? Any help would be appreciated! Thanks in advance

Both of functions are shown below.

void __device_stub__Z4axpyfPfS_(float __par0, float *__par1, float *__par2){
    void * __args_arr[3]; 
    int __args_idx = 0;
    __args_arr[__args_idx] = (void *)(char *)&__par0; 
    ++__args_idx;
    __args_arr[__args_idx] = (void *)(char *)&__par1; 
    ++__args_idx;
    __args_arr[__args_idx] = (void *)(char *)&__par2; 
    ++__args_idx;
    { 
        volatile static char *__f __attribute__((unused)); 
        __f = ((char *)((void ( *)(float, float *, float *))axpy)); 
        dim3 __gridDim, __blockDim; 
        size_t __sharedMem; 
        cudaStream_t __stream; 
        if (__cudaPopCallConfiguration(&__gridDim, &__blockDim, &__sharedMem, &__stream) != cudaSuccess) 
            return; 
        if (__args_idx == 0) { 
            (void)cudaLaunchKernel(((char *)((void ( *)(float, float *, float *))axpy)), __gridDim, __blockDim, &__args_arr[__args_idx], __sharedMem, __stream); 
        } else {
            (void)cudaLaunchKernel(((char *)((void ( *)(float, float *, float *))axpy)), __gridDim, __blockDim, &__args_arr[0], __sharedMem, __stream); 
        } 
    };
}       
void axpy( float __cuda_0,float *__cuda_1,float *__cuda_2)
# 3 "axpy.cu"
{
   __device_stub__Z4axpyfPfS_( __cuda_0,__cuda_1,__cuda_2);

}

Solution

  • So my confuse [sic] is that shouldn't cudaLaunchKernel have been passed an function pointer to kernel function axpy?

    No, because that isn't the design which NVIDIA chose and your assumption about how this function works are probably not correct. As I understand it, the first argument to cudaLaunchKernel is treated as a key, not a function pointer which is called. Elsewhere in the nvcc emitted code you will find something like this boilerplate:

    static void __nv_cudaEntityRegisterCallback( void **__T0)
    {  
        __nv_dummy_param_ref(__T0); 
        __nv_save_fatbinhandle_for_managed_rt(__T0); 
        __cudaRegisterEntry(__T0, ((void ( *)(float, float *, float *))axpy), _Z4axpyfPfS_, (-1)); 
    }
    

    You can see that the __cudaRegisterEntry call takes both a static pointer to axpy and a form the mangled symbol for the compiled GPU kernel. __cudaRegisterEntry is an internal, completely undocumented API from the CUDA runtime API. Many years ago, I satisfied myself by reverse engineering an earlier version of the CUDA runtime API, that there is an internal lookup mechanism which allows the correct instance of a host kernel stub to be matched to the correct instance of the compiled GPU code at runtime. The compiler emits a large amount of boilerplate and statically defined objects holding all the necessary definitions to make the runtime API work seamlessly without all of the additional API overhead that you need to use in the CUDA driver API or comparable compute APIs like OpenCL.

    Why is there function definition with the same name as kernel function?

    Because that is how NVIDIA decided to implement the runtime API. What you are asking about are undocumented, internal implementation details of the runtime API. You as a programmer are not supposed to have to see or use them, and they are not guaranteed to be the same from version to version.

    If, as indicated in comments, you want to implement some additional compiler infrastructure in the CUDA compilation process, use the CUDA driver API, not the runtime API.