What is the difference between the following two declarations?
__host__ __device__ void my_kernel(float* input, float* output, int size)
{
// code to be executed by each thread in parallel
}
__global__ void my_kernel(float* input, float* output, int size)
{
// code to be executed by each thread in parallel
}
As far as I understand, two copies of the same routine are created and executed in the first case.
How is the second case different from the first case?
The programming guide defines them
__global__
is used to decorate a kernel definition in CUDA. A kernel is a function that will be auto-parallelized across multiple threads, as part of the kernel launch process. So a kernel launch requires a special handling like:
my_kernel<<<...>>>(...);
it would typically be "launched" like that in host code, but specifies execution on the device (host == CPU, device == GPU)
__host__
is used to decorate an ordinary C++ style function that is callable from host code (and runs on the host processor).
__device__
is used to decorate an ordinary C++ style function that is callable from device code and runs on the device processor (i.e. callable from a thread in a kernel, or another __device__
function.)
A function definition can have both __host__
and __device__
decorations, in which case a function will be compiled for the host, and another function of the same prototype will be compiled for the device. neither __host__
nor __device__
is used for decorating a kernel function, however.