What does the __grid_constant__ parameter qualifier do?

Several weeks ago, NVIDIA's Stephen Jones gave a GTC talk named "CUDA: New features and beyond", in which he presented an upcoming feature of CUDA in v11.7: A qualifier/decorator for kernel parameters named __grid_constant__. I didn't understand the explanation about what that's supposed to mean.

Specifically, how does a __grid_constant__ int x differ from a int x ? Aren't they both just read by threads from constant memory?

Solution

It always takes a bit of time for NVidia to update the programming guide.

This has now been done, in the CUDA C++ programming guide

7.2.4. __grid_constant__

The __grid_constant__ annotation for compute architectures greater or equal to 7.0 annotates a const-qualified __global__ function parameter of non-reference type that:

Has the lifetime of the grid,

Is private to the grid, i.e., the object is not accessible to host threads and threads from other grids, including sub-grids,

Has a distinct object per grid, i.e., all threads in the grid see the same address,

Is read-only, i.e., modifying a grid_constant object or any of its sub-objects is undefined behavior, including mutable members.

This means the compiler will not block you from altering the data, but doing so will result in faulty behavior.

Requirements:

Kernel parameters annotated with grid_constant must have const-qualified non-reference types.

All function declarations must match with respect to any _grid_constant parameters.

A function template specialization must match the primary template declaration with respect to any grid_constant parameters.

A function template instantiation directive must match the primary template declaration with respect to any grid_constant parameters.

As per @paleonix comment, this allows an optimization because the compiler knows that that it can follow stricter semantics than C++.

If the address of a __global__ function parameter is taken, the compiler will ordinarily make a copy of the kernel parameter in thread local memory and use the address of the copy, to partially support C++ semantics, which allow each thread to modify its own local copy of function parameters. Annotating a __global__ function parameter with __grid_constant__ ensures that the compiler will not create a copy of the kernel parameter in thread local memory, but will instead use the generic address of the parameter itself. Avoiding the local copy may result in improved performance.
__device__ void unknown_function(S const& s);
__global__ void kernel(const __grid_constant__ S s) {    
    s.x += threadIdx.x; // Undefined Behavior: tried to modify read-only memory 
    // Compiler will _not_ create a per-thread thread local copy of "s": 
    unknown_function(s); }

Note that C++ allows alteration of s.x via multiple routes: if it is marked as mutable, if you to cast away the const-ness of s, or (gasp) using an aliased (non-const) reference. This annotation assumes you will not do such things and optimizes accordingly.

Let's unpack
Parameters for __global__ functions are passed in the read-only constant cache. Struct s (of type S) is marked as const, so that seems fine. However, C++ allows members of a const struct to be changed (e.g.: if marked as mutable or if constness is cast away).
Because the compiler does not perform a full life-time analysis, it assumes the worst and makes a per-thread copy of s in local memory (e.g.: 1024 copies if grid size = 1024), ouch.

By marking the parameter as __grid_constant__ you promise the compiler you will not be changing any mutable members. This allows the compiler to leave s in the constant cache.

The constant cache is very fast if (and only if) all threads in a warp access the same datamember. (For example: accessing different elements of an array is slow). So if you do that then this will be fast. There is no way GPU code can change the constant cache, so any such attempt will fail, likely causing your program to misbehave.

If not all threads in a warp always access the same data in s, then it may be better to allow the copy to local memory. It is fine for different warps to access different elements of s.

I guess it is most useful when you really want to keep a structure in the constant cache, but you do not want to deal with the really awkward API for loading things into the constant cache CPU side.
Now you have a trivial API for doing so, just pass a struct by value and marked it as keep in R/O cache : __grid_constant__.