What is warp shuffling in CUDA and why is it useful?

[Warp shuffle functions] exchange a variable between threads within a warp.

I understand that this is an alternative to shared memory, thus it's being used for threads within a warp to "exchange" or share values. But what's the intuition behind it (how does it work)? What's its benefit over using shared memory?

Solution

A warp shuffle is about inter-thread communication. Prior to the existence of warp shuffle, the most direct and efficient mechanism to exchange data between threads in a threadblock would be to use shared memory, such as you might do in a typical shared sweep-style reduction.

A detailed design for warp shuffle (how does it work?) isn't provided by NVIDIA, but at a behavioral level it allows for the direct exchange of register data (e.g. thread-local variables) from one thread to another in a warp, using fairly flexible source/destination descriptions. For a shuffle op to be sensible, both the source and destination thread must be participating. Additional behavioral description is given in the programming guide, and there are numerous questions here on the SO cuda tag discussing it.

An example could be:

int r = __shfl_sync(0xffffffff, value,     0); 
    ^                 ^         ^          ^
destination           bit mask  source     source lane
variable              for       variable
                      threads 
                      which 
                      must 
                      participate

After the above shuffle op, the variable r for every thread in the warp would contain the quantity held by value from thread 0 in the warp. Thread 0 has "broadcast" its value to other threads.

The primary benefits of warp shuffle are:

fewer instructions/steps/operations/(and lower latency): Whereas to communicate a data item from a register in thread B to a register in thread A via shared memory requires at least 2 steps (a shared load instruction and a shared store instruction, and probably also a synchronization step), the same communication via warp shuffle requires a single operation/instruction.
less "shared pressure": There are 2 aspects to this. First, the overall amount of shared memory used may be reduced. Warp shuffle by itself requires no shared memory, and even a full-blown 1024 thread reduction requires only 32 elements of shared memory storage. Since shared memory is a precious resource (very limited in size) and shared usage may also be a limiter to occupancy, this confers benefits in both areas: more shared memory may be available for other parts of the algorithm, and if shared usage is a concern for occupancy, then this allows to mostly avoid shared usage (for communication patterns that can be handled via shuffle ops). Second, shared pressure can manifest as throughput to shared memory. Depending on how a shared memory reduction is implemented (for example), shared memory throughput may be a limiter to performance. Therefore, if we can shift some or all shared memory transactions to an alternate path/unit, it may reduce the shared transaction pressure, which will improve code performance if that is a limiter.

As hinted at above, another possible benefit to usage of warp shuffle may be the reduction in need/usage of threadblock synchronizing instructions (such as __syncthreads()) which is generally a good thing in CUDA programming.

Although I've mostly focused on reduction for motivating examples, shuffle ops can be used to build concise operations of other types such as prefix sums.