#include <cuda_runtime.h>
#include <iostream>
#include <fstream>
using namespace std;
__host__ __device__
struct Cargs {
int x, y;
Cargs() {}
};
using Func = void(*)(int N, Cargs params);
void Hello(int N, Cargs params)
{
cout << "Hello cpu\n";
}
template <Func func>
void cpu_func()
{
int N;
Cargs params;
func(N, params);
}
// using KernelFunc = __global__ void(*)(int N, Cargs params);
// template <KernelFunc func>
// void gpu_func()
// {
// int N = 100;
// Cargs params;
// params.x = 10;
// params.y = 20;
// func<<<1,1>>>(N, params);
// }
// __global__
// void kernel1(int N, Cargs params)
// {
// printf("Hello world\n");
// }
// __global__
// void kernel2(int N, Cargs params)
// {
// printf("Hello gpu\n");
// }
int main()
{
cpu_func<Hello>();
// gpu_func<kernel1>();
// gpu_func<kernel2>();
return 0;
}
In my use case, a lot of functions share the same boiler plate code. They just apply different kernels to a set of inputs.
So to make code shorter, I want to use __global__
function as template parameter. However, I get this error
test_kernel_template.cu:25:24: error: expected ‘;’ before ‘(’ token
25 | using KernelFunc = __global__ void(*)(int N, Cargs params);
| ^~
Compile command: nvcc -o main main.cu -std=c++20
Where is the syntax wrong?
C++ libraries (like Thrust) normally prefer using device functors (classes with a device member function operator()
). This has the advantage that one can often use the same __host__ __device__
member function for CPU and GPU because things like index calculation from threadIdx
and co can be done by a very generic kernel template that then just calls the functor with the right arguments. Basically the compiler will turn the kernel template which is calling some functor into exactly the custom kernel that you need via inlining. Another advantage is that one can easily turn C++ lambdas (i.e. unnamed functors) into kernels*. A __global__
lambda is impossible, as kernels can only be free functions, not functors.
If you still want to use kernel addresses, the CUDA C++ Programming Guide only mentions one restriction:
The address of a
__global__
function taken in host code cannot be used in device code (e.g. to launch the kernel). Similarly, the address of a__global__
function taken in device code cannot be used in host code.
So your idea should generally be feasible if you do not plan to use CUDA Dynamic Paralellism. I would guess that the __global__
is just not part of the function signature, i.e. try
using KernelFunc = void(*)(int N, Cargs params);
If you can't find a signature that works you could also just use T *
, like cudaLaunchKernel()
(C++ API). Worst case you can use said cudaLaunchKernel()
for launching if you can not get the T *
to launch with the triple chevron (<<< >>>
) syntax.
Passing the kernel arguments to these functions can be tricky (the API/signature is not quite clear on its own). So I recommend taking a look at the 0_Introduction/simpleAWBarrier/simpleAWBarrier.cu
CUDA sample using cudaLaunchCooperativeKernel()
which has the same signature.
*: This has some restrictions: You need the --extended-lambda
flag for nvcc and pure __device__
lambdas have some issues, so one should use __host__ __device__
even when only planning to use the lambda on the device.