CUDA: Does the SM thread count matter when sizing a kernel's block

There are countless articles and SO questions explaining what a kernel's grid and block sizes should be set to and how to optimise these values, but the articles never seem to mention the SM limits. My understanding is that an SM can execute a maximum of 1536 threads and a maximum of 8 blocks. Surely these values have some relevance in such calculations, so why don't they crop up more?

E.g. if my kernel's block size is 128 threads then each SM will only run 1024 out of a possible 1536 threads, which is quite the under-utilisation.

Or maybe it's just me, and it's taken this long for that particular light-bulb moment, while everyone else "just knows" to factor in these things!

Solution

tl;dr: These numbers don't appear more often because:

You're looking at the wrong numbers.
The numbers you're looking at are very microarchitecture-dependent.

Now for the longer version:

My understanding is that an SM can execute a maximum of 1536 threads and a maximum of 8 blocks.

Assuming you mean NVIDIA GPUs with compute capabilities 8.6-8.9, and looking at the CUDA compute capabilities table, we find:

Feature	CC 8.6/8.7	CC 8.9
Maximum number of resident blocks per SM	16	24
Maximum number of resident threads per SM	1536	1536

and I assume this is where you got the number 1536. (Still, not 8 blocks per SM, but 16 or 24).

Yet if we look at compute capability 8.0, e.g. on the A40 cards, it is 2048 resident threads, not 1536; and 32 resident blocks. Ditto for compute capability 9.0. So, like @RobertCrovella wrote in this comment, this is very architecture-dependent.

but - even if you fix the GPU microarchitecture - it's still the wrong numbers!

The maximum size of a block is not equal to the maximum number of threads which may be resident on an SM. It is quite typical for multiple blocks to be resident on an SM, with warps from different blocks getting scheduled by availability. CUDA has always (?) defined the maximum block size to be 1024 threads - no more and no less, regardless of the microarchitecture.

As for the maximum grid size - that's more of a "numeric limits" kind of a thing. At the same table, you will find :


Maximum x -dimension of a grid of thread blocks	2³¹ - 1
Maximum y- or z-dimension of a grid of thread blocks	65536

These maxima don't depend on the block size in the threads, or on the maximum block size in threads. And while they depend on your microarchitecture in principle (e.g. NVIDIA could make a card which supports 128K blocks in the Y axis of the grid) - in practice, these values have not changed for many years.

Note, though, that even if the max number of resident threads is not the relevant number, if you choose a block size which doesn't _divide_ the max-resident-threads, then you will necessarily have some "slack" of potential resident threads you're not using: 1024-thread blocks with 1536 max resident threads means an SM will have either 0 or 1 resident blocks, and never utilize the potential for 512 more threads (= 16 more warps) - as @RobertCrovella mentions. But then again - whoever said you need to have those extra resident warps? Maybe your 1024 threads (= 32 warps) is enough to keep the SM busy? It's possible - depending on how your kernel code utilizes SM resources and how your warps interact.