Example use case for threads hierarchy in CUDA

I'm learning CUDA and read that the point of the thread hierarchy is to allow for scalability and compatibility across GPUs with differing capabilities. (I don't know if this is the sole benefit).

In NVIDIA's doc (Section 1.3), there is a small picture highlighting how an 8 block, "multi-threaded" CUDA-program can be run on both a GPU with 2 SMs and a GPU with 4 SMs. However, I fail to understand why this could not be achieved by just having threads, as opposed to having the grid and block abstractions.

Please could someone provide an example use case for the threads hierarchy, where having just threads would be insufficient?

Solution

The thread hierarchy, just like the memory hierarchy is all about locality of reference.

Warp
Threads are grouped in clumps of 32: a warp. Threads in a warp can share data using the shfl_[up_,down_,xor_,_]sync, warp reduction, warp match and the (very useful) warp vote instructions.

Threads in a warp have cheap synchronisation: __syncwarp

Thread block
The next step up is a thread block, up to 32 warps (1024 threads) that collaborate in the same multiprocessor (SM). There are no dedicated instructions for intra-block communication, but threads in a block can used __shared__ memory to exchange data. This is almost as efficient as __shfl_sync.

Grid block
Next up we have multiprocessors (SMs), a GPU can have 20 or 100 of those. These (prior to Blackwell) can only communicate via global memory. A thread in SM a stores data in global mem, and a thread in SM b reads that data. Two grid blocks typically run on different SMs (and if they happen to be on the same SM, they cannot share __shared__ memory). (In Blackwell, a thread in SM a can move data from its shared memory to the shared memory in another SM).

Different GPUs
The next level of the hierarchy is running your program on multiple GPUs on the same machine. These can only communicate across the GPU interlink (whose name escapes me).
This is obviously even slower and more complicated.

Different machines the final level is running on different machines which communicate via the network (either local, or over the internet).

When to use what
Obviously a single warp can only perform 32 concurrent tasks.
A block only 1024 concurrent tasks, and so on.
If you have limited time, and you need to use all 20x32x32 = 20480 threads on a GPU with 20 SMs, then you need a grid block, and you'll have to design your program accordingly.

Every step up the thread pyramid executes faster than the one below it, because the distance on the chip between threads is smaller, allowing the manufacturer to design fast interconnects. The hierarchy also eases the programming model, so that different versions of GPUs can be programmed in the same way.

But why?
Designing it like this, allows the GPU to provide fast options at warp/block level. If we only had grid blocks, then every thread will be forced to only communicate via global memory.

The following example to get the maximum value (out of 32) from a bitonic sequence that looks like this:

  /\
 /  \  e.g.:  1, 2, 10, 20, 30, 500, 400, 3, 0.
/


__device__ int GetMaxValue(const int value) {  //every thread has different value
    //the input must be a bitonic sequence
    constexpr auto All = -1u; //all threads in the warp take part
    const auto Neighbor = __shfl_down_sync(All, value, 1); //thread 31 gets its own value back
    const auto mask = __ballot_sync(All, Neighbor > value); //e.g. 0b0000000011111111
    const auto MaxBit = __popc(mask); //The thread holding the max value
    const auto result = __shfl_sync(All, value, MaxBit); //share max value with all
    return result; //all threads return the same result
}

is much faster than the equivalent code using only global memory, if we pretend that warps and blocks do not exist:

__device__ int GetMaxValue_NoWarp(int value, int* tempstorage) {
    tempstorage[32] = 0;
    tempstorage[threadIdx.x] = value;
    __syncthreads();
    const auto Neighbor = tempstorage[threadIdx.x + 1];
    if (threadIdx.x == 31) { Neighbor = value; }
    const auto diff = int(Neighbor > value); 
    atomicOr(&tempstorage[32], diff << threadIdx.x);
    __syncthreads();
    const auto mask = tempstorage[32];
    const auto MaxBit = __popc(mask); 
    result = tempstorage[MaxBit];
    return result;
}

Even a single read from tempstorage takes longer than all of GetMaxValue, and GetMaxValue_NoWarp needs many reads and writes, because it does not have a fast mechanism to share data between threads.