The "cooperative groups" mechanism has appeared in recent versions of CUDA. Some of it involves actual hardware features which are less obvious (?) to utilize otherwise; but a lot of it is basically just library code; and it's difficult to discern where the hardware actually assists with some special functionality.
This question is about GPUs with Compute Capability 7.x. In the CUDA programming guide, I notice the following features dependencies on compute capabilities:
Cooperative group feature | Required minimum Compute Capability |
---|---|
labeled_partition() |
7.0 |
binary_partition() |
7.0 |
async_memcpy() actually being asynchronous |
8.0 |
Some kind of asynchronicity of wait() |
8.0 |
"acceleration" of reduce() |
8.0 |
use of intrinsics in reductions for: plus, less, greater, bitwise and, bitwise or, bitwise_xor | 8.0 |
What hardware features, specifically, were introduced with CC 7.0 and with CC 8.0 which enable this functionality? What are their exact semantics? And are they all explicitly exposed via PTX, or are some of them only visible in SASS?
You can find some, or all, of these new hardware features as newly-introduced PTX instructions in the latest PTX ISA reference.
There are now PTX-level single operands which perform a reduction of the values of some of the active threads in a warp:
redux.sync.op.type dst, src, membermask;
.op = {.add, .min, .max}
.type = {.u32, .s32}
redux.sync.op.b32 dst, src, membermask;
.op = {.and, .or, .xor}
All participating threads get the result in the dst
register. Overflow beyond 32-bits is truncated.
These are somewhat similar to the vote.sync
operations, I suppose.
These are actually several related operations:
cp.async.{ca, cg}.shared.global
cp.async.commit_group
cp.async.{wait_group, wait_all}
This allows "registering" data for a copy, packing up registered data together into a group, then waiting until the data in that group is ready. But - it only seems to apply to copying from global memory to shared memory (and not even in the other direction).
Read more about this mechanism here. Before CC 8.0, the way you perform such copies this is with ld.whatever
instructions with global memory as the source, then st.whatever
into shared memory, then finally a barrier.sync.aligned
instruction (or non-aligned with a number of threads if you want only a subset of the block threads to synchronize).
(to be added: more instructions)