CUDA using global kernel as template parameter

#include <cuda_runtime.h>
#include <iostream>
#include <fstream>
using namespace std;

__host__ __device__
struct Cargs {
  int x, y;
  Cargs() {}
};

using Func = void(*)(int N, Cargs params);
void Hello(int N, Cargs params)
{
  cout << "Hello cpu\n";
}
template <Func func>
void cpu_func()
{
  int N;
  Cargs params;
  func(N, params);
}

// using KernelFunc = __global__ void(*)(int N, Cargs params);
// template <KernelFunc func>
// void gpu_func()
// {
//   int N = 100;
//   Cargs params;
//   params.x = 10;
//   params.y = 20;
//   func<<<1,1>>>(N, params);
// }

// __global__
// void kernel1(int N, Cargs params)
// {
//   printf("Hello world\n");
// }

// __global__
// void kernel2(int N, Cargs params)
// {
//   printf("Hello gpu\n");
// }

int main()
{
  cpu_func<Hello>();
  // gpu_func<kernel1>();
  // gpu_func<kernel2>();
  return 0;
}

In my use case, a lot of functions share the same boiler plate code. They just apply different kernels to a set of inputs.

So to make code shorter, I want to use __global__ function as template parameter. However, I get this error

test_kernel_template.cu:25:24: error: expected ‘;’ before ‘(’ token
   25 | using KernelFunc = __global__ void(*)(int N, Cargs params);
      |                        ^~

Compile command: nvcc -o main main.cu -std=c++20

Where is the syntax wrong?

Solution

Alternative: Device Functors

C++ libraries (like Thrust) normally prefer using device functors (classes with a device member function operator()). This has the advantage that one can often use the same __host__ __device__ member function for CPU and GPU because things like index calculation from threadIdx and co can be done by a very generic kernel template that then just calls the functor with the right arguments. Basically the compiler will turn the kernel template which is calling some functor into exactly the custom kernel that you need via inlining. Another advantage is that one can easily turn C++ lambdas (i.e. unnamed functors) into kernels*. A __global__ lambda is impossible, as kernels can only be free functions, not functors.

Pointers to Kernels

If you still want to use kernel addresses, the CUDA C++ Programming Guide only mentions one restriction:

The address of a __global__ function taken in host code cannot be used in device code (e.g. to launch the kernel). Similarly, the address of a __global__ function taken in device code cannot be used in host code.

So your idea should generally be feasible if you do not plan to use CUDA Dynamic Paralellism. I would guess that the __global__ is just not part of the function signature, i.e. try

using KernelFunc =  void(*)(int N, Cargs params);

If you can't find a signature that works you could also just use T *, like cudaLaunchKernel() (C++ API). Worst case you can use said cudaLaunchKernel() for launching if you can not get the T * to launch with the triple chevron (<<< >>>) syntax.

Passing the kernel arguments to these functions can be tricky (the API/signature is not quite clear on its own). So I recommend taking a look at the 0_Introduction/simpleAWBarrier/simpleAWBarrier.cu CUDA sample using cudaLaunchCooperativeKernel() which has the same signature.

*: This has some restrictions: You need the --extended-lambda flag for nvcc and pure __device__ lambdas have some issues, so one should use __host__ __device__ even when only planning to use the lambda on the device.

CUDA using __global__ kernel as template parameter

Alternative: Device Functors

Pointers to Kernels

CUDA using global kernel as template parameter