What is the difference between global and host device?

What is the difference between the following two declarations?

__host__ __device__ void my_kernel(float* input, float* output, int size) 
{
  // code to be executed by each thread in parallel
}

__global__ void my_kernel(float* input, float* output, int size) 
{
  // code to be executed by each thread in parallel
}

As far as I understand, two copies of the same routine are created and executed in the first case.

How is the second case different from the first case?

Solution

The programming guide defines them

__global__ is used to decorate a kernel definition in CUDA. A kernel is a function that will be auto-parallelized across multiple threads, as part of the kernel launch process. So a kernel launch requires a special handling like:

my_kernel<<<...>>>(...);

it would typically be "launched" like that in host code, but specifies execution on the device (host == CPU, device == GPU)

__host__ is used to decorate an ordinary C++ style function that is callable from host code (and runs on the host processor).

__device__ is used to decorate an ordinary C++ style function that is callable from device code and runs on the device processor (i.e. callable from a thread in a kernel, or another __device__ function.)

A function definition can have both __host__ and __device__ decorations, in which case a function will be compiled for the host, and another function of the same prototype will be compiled for the device. neither __host__ nor __device__ is used for decorating a kernel function, however.

What is the difference between __global__ and __host__ __device__?

What is the difference between global and host device?