There are countless articles and SO questions explaining what a kernel's grid and block sizes should be set to and how to optimise these values, but the articles never seem to mention the SM limits. My understanding is that an SM can execute a maximum of 1536 threads and a maximum of 8 blocks. Surely these values have some relevance in such calculations, so why don't they crop up more?
E.g. if my kernel's block size is 128 threads then each SM will only run 1024 out of a possible 1536 threads, which is quite the under-utilisation.
Or maybe it's just me, and it's taken this long for that particular light-bulb moment, while everyone else "just knows" to factor in these things!
tl;dr: These numbers don't appear more often because:
Now for the longer version:
My understanding is that an SM can execute a maximum of 1536 threads and a maximum of 8 blocks.
Assuming you mean NVIDIA GPUs with compute capabilities 8.6-8.9, and looking at the CUDA compute capabilities table, we find:
Feature | CC 8.6/8.7 | CC 8.9 |
---|---|---|
Maximum number of resident blocks per SM | 16 | 24 |
Maximum number of resident threads per SM | 1536 | 1536 |
and I assume this is where you got the number 1536. (Still, not 8 blocks per SM, but 16 or 24).
Yet if we look at compute capability 8.0, e.g. on the A40 cards, it is 2048 resident threads, not 1536; and 32 resident blocks. Ditto for compute capability 9.0. So, like @RobertCrovella wrote in this comment, this is very architecture-dependent.
but - even if you fix the GPU microarchitecture - it's still the wrong numbers!
The maximum size of a block is not equal to the maximum number of threads which may be resident on an SM. It is quite typical for multiple blocks to be resident on an SM, with warps from different blocks getting scheduled by availability. CUDA has always (?) defined the maximum block size to be 1024 threads - no more and no less, regardless of the microarchitecture.
As for the maximum grid size - that's more of a "numeric limits" kind of a thing. At the same table, you will find :
Maximum x -dimension of a grid of thread blocks | 231 - 1 |
Maximum y- or z-dimension of a grid of thread blocks | 65536 |
These maxima don't depend on the block size in the threads, or on the maximum block size in threads. And while they depend on your microarchitecture in principle (e.g. NVIDIA could make a card which supports 128K blocks in the Y axis of the grid) - in practice, these values have not changed for many years.