What is the difference between maximum number of thread per block vs cuda cores in one SM

Although I have searched for this question and read many related answers, I only find myself more confused.

First, let me present my understanding:

Hardware: An SM contains multiple warps, with each warp typically having 32 CUDA cores. These cores function like a simple computational unit, performing one calculation per clock cycle. (According to the time step of this video)
Software: There is a single grid (I believe?), consisting of multiple blocks, with each block containing multiple threads and each block having a certain limit called "max threads per block."

Next, I will number my understandings. If I am wrong, please explain it to me; if I am correct, please confirm. Thank you so much for taking your time.

I understand that each block executes on only one SM and does not move between different SMs. For example, for the RTX 3090, according to TechPowerUp, the RTX 3090 has 10,496 shading units (CUDA cores) and 82 SMs. This means each SM has 10,496 / 82 = 128 CUDA cores, which translates to 128 / 32 = 4 warps.
I assume that each CUDA core executes one thread at a time. The 32 CUDA cores in each warp will perform one instruction for 32 threads.
However, I often see the "max threads per block" number exceed (usually 2048) the number of CUDA cores in an SM. I think 2048 / 32 = 64 warps would be scheduled on one SM, not executed simultaneously.
I believe that the "max threads per block" = 2048 is a memory limit, meaning the maximum number of threads that can be recorded and scheduled at the same time, not the maximum number of threads that an SM can execute simultaneously (in one clock cycle).

However, in some answers or articles, I see statements like "each core executes 32 threads."

I believe that statement is incorrect.

Another example in this article, the author states that:

For both of the following examples, we assume we have the same computational resources. For this small example of a 32×32 matrix multiply, we use 8 SMs (about 10% of an RTX 3090) and 8 warps per SM.

....

If we want to do an AB=C matrix multiply, where each matrix is of size 32×32, then we want to load memory that we repeatedly access into shared memory because its latency is about five times lower (200 cycles vs 34 cycles). A memory block in shared memory is often referred to as a memory tile or just a tile. Loading two 32×32 floats into a shared memory tile can happen in parallel by using 232 warps. We have 8 SMs with 8 warps each, so due to parallelization, we only need to do a single sequential load from global to shared memory, which takes 200 cycles.

I believe this statement is incorrect. As I mentioned above, the RTX 3090 has only 4 warps per SM. Two 32x32 matrices require 2x32 warps, which means 16 SMs, not 8 SMs as the author stated.

Solution

An SM contains multiple warps, with each warp typically having 32 CUDA cores.

Wrong. "CUDA core" is a marketing term that refers to the number of scalar 32 bit floating point execution units. Meaning, it is the number of simple (add, multiply) float operations that an SM / GPU can execute per clock cycle. This is unrelated to the number threads per warp or number of warps per SM.

If you look, for example, at the Volta architecture whitepaper at figure 5, you see that an SM is partitioned into four processing blocks. A warp always resides inside a single processing block and those blocks have only 16 FP32 cores, a.k.a. CUDA cores. This means it takes them two cycles to execute an FP32 instruction for a whole 32 thread warp. How many cycles it takes to execute an instruction is an implementation detail that changes and that you normally do not care about except when calculating total processing power of a GPU.

There is a single grid (I believe?)

Wrong, you can execute multiple grids simultaneously in multiple streams.

I understand that each block executes on only one SM and does not move between different SMs

Mostly correct. Modern GPUs support preemption and when that happens, the block may be evicted and later restored to a different SM. But you can say that they only execute on a single SM at a time.

Note with regard to the processing blocks discussed above, a thread block is distributed over all processing blocks on an SM. They all live on the same SM to make use of the shared memory and synchronization primitives.

I assume that each CUDA core executes one thread at a time.

Correct. Or rather, it starts a new instruction per cycle. See the discussion on pipelining below.

The 32 CUDA cores in each warp will perform one instruction for 32 threads.

Again, a CUDA core does not belong to a warp. They are effectively shared by all threads assigned to the processing block (or SM in older architectures with a single processing block per SM).

I believe that the "max threads per block" = 2048 is a memory limit, meaning the maximum number of threads that can be recorded and scheduled at the same time, not the maximum number of threads that an SM can execute simultaneously (in one clock cycle).

Half-right, half-wrong. 2048 is the limit of threads per SM. The limit per thread block is 1024. Therefore you always need two or more blocks to fully occupy an SM. And in GPU generations with 1536 threads per SM, thread blocks sized 1024 can never fully occupy the GPU since only one fits.

You are right that the limit of threads is independent from the limit of simultaneously executed operations. I wouldn't call it a memory limit, though, since it is more a limit of the scheduler hardware; but that's secondary. Referring back to the Volta architecture above, note how its dispatcher and scheduler have 32 threads per clock cycle throughput with only 16 FP32 cores, and even fewer load/store units etc. The only way this works is if memory or integer operations are interleaved with floating point operations.

However, in some answers or articles, I see statements like "each core executes 32 threads."
5. I believe that statement is incorrect.

Citation needed but yeah, that statement is incorrect. One may say that a processing block executes 32 threads per clock cycle. Again, with the caveats discussed above.

the RTX 3090 has only 4 warps per SM. Two 32x32 matrices require 2x32 warps, which means 16 SMs, not 8 SMs as the author stated.

Well, again incorrect. I don't think this needs repeating.

Instead, let's talk about the reason for this setup with more threads than cores: It simply exists to hide the latency of operations. Many operations are pipelined which means a new instruction can start each cycle but they take multiple cycles to complete (typically 4 on Volta, again subject to change).

Others have high latency waiting for the memory subsystem. In both cases, the scheduler needs to find independent instructions to start each cycle. That's what the "over-subscription" does. Different warps are by definition independent and therefore offer the best way of finding those. It's the same principle used by CPUs in SMT, a.k.a. hyperthreading.

In the matrix prefetching you cited, it is rather secondary that the SM takes multiple cycles to start the memory load operations. Yes, it takes multiple cycles to start the operation because there are fewer load/store units than threads. But most of the time is spent waiting for a response from the GPU main memory anyway. The point is that all those memory loads are independent. Therefore they can be scheduled without waiting for the previous load to complete. And you also remove redundant loads / replace them with cheaper shared memory loads.

Note that you don't necessarily need 2048 threads to do this as the cited article implies. GPUs can run independent instructions from the same thread in multiple cycles without waiting for the previous result. This is called scoreboarding (however, a GPU's ability to do this is far more limited than a modern CPU's, which is why CPUs get away with only run one or two threads per core). You can also combine four float loads into a single float4 memory load if you can guarantee proper memory alignment. And modern CUDA supports asynchronous memory copies by itself.