Search code examples
cudagpu

What is the difference between __global__ and __host__ __device__?


What is the difference between the following two declarations?

__host__ __device__ void my_kernel(float* input, float* output, int size) 
{
  // code to be executed by each thread in parallel
}
__global__ void my_kernel(float* input, float* output, int size) 
{
  // code to be executed by each thread in parallel
}

As far as I understand, two copies of the same routine are created and executed in the first case.

How is the second case different from the first case?


Solution

  • The programming guide defines them

    __global__ is used to decorate a kernel definition in CUDA. A kernel is a function that will be auto-parallelized across multiple threads, as part of the kernel launch process. So a kernel launch requires a special handling like:

    my_kernel<<<...>>>(...);
    

    it would typically be "launched" like that in host code, but specifies execution on the device (host == CPU, device == GPU)

    __host__ is used to decorate an ordinary C++ style function that is callable from host code (and runs on the host processor).

    __device__ is used to decorate an ordinary C++ style function that is callable from device code and runs on the device processor (i.e. callable from a thread in a kernel, or another __device__ function.)

    A function definition can have both __host__ and __device__ decorations, in which case a function will be compiled for the host, and another function of the same prototype will be compiled for the device. neither __host__ nor __device__ is used for decorating a kernel function, however.