Cuda Tensor Cores: What is the effect of NumBlocks and ThreadsPerBlock?

I am wondering what the effect of NumBlocks and ThreadsPerBlock on this simple matrix multiplication routine is

__global__ void wmma_matrix_mult(half *a, half *b, half *out) {

   // Declare the fragments
   wmma::fragment<wmma::matrix_a, M, N, K, half, wmma::row_major> a_frag;
   wmma::fragment<wmma::matrix_b, M, N, K, half, wmma::row_major> b_frag;
   wmma::fragment<wmma::accumulator, M, N, K, half> c_frag;

   // Initialize the output to zero
   wmma::fill_fragment(c_frag, 0.0f);

   // Load the inputs
   wmma::load_matrix_sync(a_frag, a, N);
   wmma::load_matrix_sync(b_frag, b, N);

   // Perform the matrix multiplication
   wmma::mma_sync(c_frag, a_frag, b_frag, c_frag);

   // Store the output
   wmma::store_matrix_sync(out, c_frag, N, wmma::mem_row_major);
}

Calling

`wmma_matrix_mult<<1, 1>>`: Incorrect
`wmma_matrix_mult<<1, 2>>`: Incorrect
`wmma_matrix_mult<<1, 4>>`: Incorrect
`wmma_matrix_mult<<1, 8>>`: Incorrect
`wmma_matrix_mult<<1, 16>>`: Incorrect
`wmma_matrix_mult<<1, 32>>`: Correct

Why does the number of threads per block even matter if every thread is doing then same execution? As you can see, I am not doing anything with threadIdx.x inside the kernel.

Solution

Tensor core operations happen at the warp level. The w in wmma signifies that. Referring to the documentation:

This requires co-operation from all threads in a warp.

Each tensorcore unit can accept one matrix multiply operation (i.e. wmma::mma_sync), from a warp, per clock cycle.

This means that a full warp (32 threads) must be available and participating, for the operation to make any sense (i.e. to be legal). All of the wmma:: operations are collective ops, which means that an entire warp is expected to be executing them, and is necessary for correct usage.

If you have multiple warps participating (e.g. a threadblock size of 64, or 128, etc.), you are effectively asking for multiple operations to be done, just like any other CUDA code.

Like any other CUDA code, launching an operation with multiple blocks is just a way to scale up the work being done, and of course is necessary if you want to engage the resources of a GPU that has multiple SMs. Since tensorcore units are a per-SM resource, this would be necessary to witness a CUDA GPU delivering anything approaching its full rated throughput for tensorcore ops.

Why does the number of threads per block even matter if every thread is doing then same execution?

Every thread is not doing the same thing. The wmma:: collective ops are hiding code under the hood that is specializing thread behavior according to which warp lane it belongs to. For example, the thread in warp lane 0 will select different elements of the fragment to associate with (i.e. load, store) than any thread in any other warp lane.