Several weeks ago, NVIDIA's Stephen Jones gave a GTC talk named "CUDA: New features and beyond", in which he presented an upcoming feature of CUDA in v11.7: A qualifier/decorator for kernel parameters named __grid_constant__
. I didn't understand the explanation about what that's supposed to mean.
Specifically, how does a __grid_constant__ int x
differ from a int x
? Aren't they both just read by threads from constant memory?
It always takes a bit of time for NVidia to update the programming guide.
This has now been done, in the CUDA C++ programming guide
7.2.4.
__grid_constant__
The
__grid_constant__
annotation for compute architectures greater or equal to 7.0 annotates a const-qualified__global__
function parameter of non-reference type that:
Has the lifetime of the grid,
Is private to the grid, i.e., the object is not accessible to host threads and threads from other grids, including sub-grids,
Has a distinct object per grid, i.e., all threads in the grid see the same address,
Is read-only, i.e., modifying a grid_constant object or any of its sub-objects is undefined behavior, including mutable members.
This means the compiler will not block you from altering the data, but doing so will result in faulty behavior.
Requirements:
Kernel parameters annotated with grid_constant must have const-qualified non-reference types.
All function declarations must match with respect to any _grid_constant parameters.
A function template specialization must match the primary template declaration with respect to any grid_constant parameters.
A function template instantiation directive must match the primary template declaration with respect to any grid_constant parameters.
As per @paleonix comment, this allows an optimization because the compiler knows that that it can follow stricter semantics than C++.
If the address of a
__global__
function parameter is taken, the compiler will ordinarily make a copy of the kernel parameter in thread local memory and use the address of the copy, to partially support C++ semantics, which allow each thread to modify its own local copy of function parameters. Annotating a__global__
function parameter with__grid_constant__
ensures that the compiler will not create a copy of the kernel parameter in thread local memory, but will instead use the generic address of the parameter itself. Avoiding the local copy may result in improved performance.__device__ void unknown_function(S const& s); __global__ void kernel(const __grid_constant__ S s) { s.x += threadIdx.x; // Undefined Behavior: tried to modify read-only memory // Compiler will _not_ create a per-thread thread local copy of "s": unknown_function(s); }
Note that C++ allows alteration of s.x
via multiple routes: if it is marked as mutable
, if you to cast away the const-ness of s
, or (gasp) using an aliased (non-const) reference. This annotation assumes you will not do such things and optimizes accordingly.
Let's unpack
Parameters for __global__
functions are passed in the read-only constant cache. Struct s
(of type S
) is marked as const
, so that seems fine. However, C++ allows members of a const struct to be changed (e.g.: if marked as mutable
or if constness is cast away).
Because the compiler does not perform a full life-time analysis, it assumes the worst and makes a per-thread copy of s
in local memory (e.g.: 1024 copies if grid size = 1024), ouch.
By marking the parameter as __grid_constant__
you promise the compiler you will not be changing any mutable
members. This allows the compiler to leave s
in the constant cache.
The constant cache is very fast if (and only if) all threads in a warp access the same datamember. (For example: accessing different elements of an array is slow). So if you do that then this will be fast. There is no way GPU code can change the constant cache, so any such attempt will fail, likely causing your program to misbehave.
If not all threads in a warp always access the same data in s
, then it may be better to allow the copy to local memory. It is fine for different warps to access different elements of s
.
I guess it is most useful when you really want to keep a structure in the constant cache, but you do not want to deal with the really awkward API for loading things into the constant cache CPU side.
Now you have a trivial API for doing so, just pass a struct by value and marked it as keep in R/O cache
: __grid_constant__
.