Does __shfl_sync in CUDA always operate on registers, or does it involve shared memory or global memory in some certain situations?

This question is inspired by the discussion in this link: When is shfl.sync.idx fast?, where it was debated whether __shfl_sync() and other warp-level shuffle instructions operate directly on the register file in addition to the original question "When is shfl.sync.idx fast?".

From my understanding, variables in kernel functions are allocated to registers or memory (global/shared) based on compiler optimizations, and developers cannot explicitly force variables to reside in registers.

This raises two related questions:

Is it guaranteed that variables used in __shfl_sync() or other warp-level shuffle instructions are stored in registers and not in shared or global memory? If so, what ensures this?
If there is no such guarantee, how can we explain why __shfl_sync() is consistently faster than exchanging data through shared or global memory? Does this imply that register-level operations are not strictly required for its performance advantage?

Here’s an example for context:

__global__ void shuffle_example(float *output) {
    float val = threadIdx.x; // Thread-specific value
    float shuffled = __shfl_sync(0xFFFFFFFF, val, 0); // Get value from lane 0
    output[threadIdx.x] = shuffled;
}

How can we ensure that val is in a register during the shuffle?

Solution

__shfl_* and its improved version __shfl_*_sync always operate on registers. They have to, because their very premise is to be a faster alternative to shuffling data using shared memory.

They have been a part of CUDA since version 5 (version 9 (I think) for the sync versions) and have always operated the same.

The shuffle instructions share data between threads in the same warp if the source is known (a gather operation). This is often used to communicate a single value to all threads, or to perform reduce operations -- although on newer hardware reductions can be more efficiently done using atomic operations on shared memory, (or from Ampere onwards) using the __reduce_*_sync instructions.

On every incarnation of NVidia hardware that supports shfl (Kepler or newer), it is implemented in SASS as a SHFL instruction, with an added SYNC as needed. This can be easy tested by selecting the SASS display on compiler explorer with a custom --arch=smXX command line, see: https://cuda.godbolt.org/z/WYzcn83d1

If you vary the --arch setting, you'll see the exact same SASS code for __shfl, even though the surrounding instruction mix may change.

__global__ void shfl(int* data) {
    const auto a = threadIdx.x;
    data[threadIdx.x] = __shfl_down_sync(-1u, a, 1);
}

compilation:

architecture	SASS
sm52	`SHFL.DOWN PT, R0, R4, 0x1, 0x1f`
sm60	`SHFL.DOWN PT, R0, R4, 0x1, 0x1f`
sm70	`SHFL.DOWN PT, R5, R0, 0x1, 0x1f`
sm80	`SHFL.DOWN PT, R5, R0, 0x1, 0x1f`
sm90	`SHFL.DOWN PT, R5, R7, 0x1, 0x1f`

Unfortunately, compiler explorer only supports downgrading CUDA down to version 9.0, so I cannot show you the SASS output before Pascal, but I know from memory that all they way back to CUDA 5/Kepler, the output is in fact the same.

When is shfl faster than shared or global memory
The problem with shfl is that it has latency, if you do, say, a reduction, you can do this like so:

const auto a = threadIdx.x;
auto sum = a;
sum += __shfl_down_sync(-1u, sum, 1);
sum += __shfl_down_sync(-1u, sum, 2);
sum += __shfl_down_sync(-1u, sum, 4);
sum += __shfl_down_sync(-1u, sum, 8);
sum += __shfl_down_sync(-1u, sum, 16);

These instructions form a dependency chain where every instruction needs to wait for the result of the previous instruction.

If you code it like so:

const auto a = threadIdx.x;
auto sum = a;
__shared__ int result;
result = 0;
atomicAdd(&result, a);

It will run faster on newer hardware, I forget how much, but faster.
This is because atomic operations on shared memory have been optimized a lot on newer hardware, and shfl has not.

If you have Ampere or newer, the __reduce__add_sync will do the reduction faster still.

Global memory and the L2 cache are both so slow that they are never faster than shfl. On modern hardware the guideline is that shfl will be faster if you can save a write + read from shared memory, but a single read from shared mem will be faster than shfl.
A single atomicAdd will beat a stack of 5 shfl's to do a reduction.

So shfl has its uses, but it is no longer the catch-all tool is was back in the Pascal days.

Spilling of registers to the stack
In CUDA the stack is backed by local memory, which is really global (sigh) memory backed by the L1/L2 cache.
Unlike on the CPU, on NVidia GPUs the stack is slow.
The way the stack works is that the GPU will do a LDL/STL (load from /store local memory) and then use the registers.
The GPU is best viewed as a RISC processor, so it does not have x86-like instructions that can operate directly on main memory. Every non-load, non-atomic instruction always operates on registers only (with some exceptions for constant memory).

Stop looking at PTX
PTX is a byte-code, like the Java JVM. It has a bunch of simplifying abstractions that the underlying GPU-hardware does not have, it also misses a pass of the excellent NVidia optimizer. It is somewhat like looking at -O0 Intel x86 assembly, that's not the code your GPU actually executes.