what's cga in cuda programming model

Hi I could understand CTA, which is cooperative thread arrays. But what's the CGA? What's the relationship between cta and cga. I don't see a document that could well explain these.

Solution

CGA is a new addition to cooperative groups in the Hopper achitecture.

To disambiguate:

Thread: a single thread, always part of a warp.
Warp: 32 threads executing in lockstep*
Block: aka CTA, up to 1024 threads executing on a single multiprocessor
Grid: one or more blocks executing the same code; each block executes on a different multiprocessor.
CTA: Cooperative thread array, see Block.
Cooperative groups: Since CUDA 9: an API that allows groups of warps in a more flexible arrangement than blocks and grids.
CGA: a cooperative grid array, a new cooperative grouping introduced in the Hopper architecture (sm90). I think this is mentioned in the NVIDIA Hopper Architecture In-Depth

New thread block cluster feature enables programmatic control of locality at a granularity larger than a single thread block on a single SM. This extends the CUDA programming model by adding another level to the programming hierarchy to now include threads, thread blocks, thread block clusters, and grids. Clusters enable multiple thread blocks running concurrently across multiple SMs to synchronize and collaboratively fetch and exchange data.

Annoyingly they name it a thread block cluster here and not a cooperative grid array. And also they used the word cluster in a different context in the Ampere docs.

All these groupings are implemented using the cooperative groups API.

See: Cooperative groups in CUDA
The documentation is here: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cooperative-groups

And a blog post is here: https://developer.nvidia.com/blog/cooperative-groups/

CGA specifically is mentioned in the technical block for new CUDA 12 features (for Hopper), namely:

Support for C intrinsics for cooperative grid array (CGA) relaxed barriers

These are documented here in the CUDA programming guide as:

barrier_arrive and barrier_wait member functions were added for grid_group and thread_block. Description of the API is available here.

These barriers are a big deal, because that is how threads can synchronize which is vital if they are to cooperate harmoniously. We could always do this through global memory, but that must go through L2 cache (200 cycles) or even through main memory (500 cycles). On Hopper this can go across a special matrix bus between the shared memory (which is a L1 kind of cache), ergo much faster communication.

This innovation is enabled as follows. On Ampere and before, each SM (aka block) has its own private shared memory area. However, on Hopper every SM can access the shared memory of every other SM, something NVidia calls a cluster. This allows for very efficient inter block communication between SMs. These new barriers are implemented using the fast access to shared memory between blocks in the cluster.

This mechanism to access another blocks shared memory is well not documented in the CUDA programming guide, but is detailed in the PTX ISA documention.
In the CUDA programming guide it is buried in the memcpy_async section.

*) On Volta and later threads in a warp can diverge, but doing so is computationally expensive.