How do Compute Capabilities 7.x & 8.x assist cooperative group operations?

The "cooperative groups" mechanism has appeared in recent versions of CUDA. Some of it involves actual hardware features which are less obvious (?) to utilize otherwise; but a lot of it is basically just library code; and it's difficult to discern where the hardware actually assists with some special functionality.

This question is about GPUs with Compute Capability 7.x. In the CUDA programming guide, I notice the following features dependencies on compute capabilities:

Cooperative group feature	Required minimum Compute Capability
`labeled_partition()`	7.0
`binary_partition()`	7.0
`async_memcpy()` actually being asynchronous	8.0
Some kind of asynchronicity of `wait()`	8.0
"acceleration" of `reduce()`	8.0
use of intrinsics in reductions for: plus, less, greater, bitwise and, bitwise or, bitwise_xor	8.0

What hardware features, specifically, were introduced with CC 7.0 and with CC 8.0 which enable this functionality? What are their exact semantics? And are they all explicitly exposed via PTX, or are some of them only visible in SASS?

Solution

You can find some, or all, of these new hardware features as newly-introduced PTX instructions in the latest PTX ISA reference.

Reduction acceleration (CC 8.0)

There are now PTX-level single operands which perform a reduction of the values of some of the active threads in a warp:

redux.sync.op.type dst, src, membermask;
.op   = {.add, .min, .max}
.type = {.u32, .s32}

redux.sync.op.b32 dst, src, membermask;
.op   = {.and, .or, .xor}

All participating threads get the result in the dst register. Overflow beyond 32-bits is truncated.

These are somewhat similar to the vote.sync operations, I suppose.

Thread-asnychronous copying (CC 8.0)

These are actually several related operations:

cp.async.{ca, cg}.shared.global
cp.async.commit_group
cp.async.{wait_group, wait_all}

This allows "registering" data for a copy, packing up registered data together into a group, then waiting until the data in that group is ready. But - it only seems to apply to copying from global memory to shared memory (and not even in the other direction).

Read more about this mechanism here. Before CC 8.0, the way you perform such copies this is with ld.whatever instructions with global memory as the source, then st.whatever into shared memory, then finally a barrier.sync.aligned instruction (or non-aligned with a number of threads if you want only a subset of the block threads to synchronize).

(to be added: more instructions)