Connection between number of registers in thread block and in Streaming Multiprocessor (SM)

I was testing my simple program of acquiring GPU-specific data in this way:

#include <stdio.h>
#include <cuda_runtime.h>

int main() {
    int deviceCount;
    cudaGetDeviceCount(&deviceCount);

    if (deviceCount == 0) {
        printf("No CUDA capable devices found!\n");
        return 1;
    }

    cudaDeviceProp prop;
    cudaGetDeviceProperties(&prop, 0);

    printf("Device Name: %s\n", prop.name);
    printf("Regs Per Block: %d\n", prop.regsPerBlock);
    printf("Regs Per Multiprocessor: %d\n", prop.regsPerMultiprocessor);

    return 0;
}

And what i found out, was that the number of registers per (thread) block is the same as number of registers per Multiprocessor (SM) - 65536.

GPU used for tests: Nvidia GTX 1650 (arch - Turing).

Since i assumed each SM has >1 thread block, the number of registers per SM should be quite more, unless thread block can use all the registers assigned to particular SM. I couldn't find any credible sources discussing the overall idea. Could someone explain this part please?
What would be the general way of finding out the number of registers per SM assuming i don't want to use regsPerMultiprocessor? What is the connection between these 2 parameters and is it possible to compute the latter by using CUDA / nvidia-smi / nvidia-settings only (thus, no user-hardcoded values)?

Solution

The reported data is correct.

Since i assumed each SM has >1 thread block,

Not necessarily. A Turing SM has a maximum limit of 1024 threads. It's true that the number of blocks that can be resident is higher than 1, but the SM has a variety of hardware limits that all must be satisifed, in order for a threadblock to be deposited on a SM.

Given that a Turing SM has 65536 registers (it really does, and there is no other number), this allows for 64 registers for each of 1024 threads. You could have two threadblocks of 512 threads each (64 registers per thread), but you could not have two threadblocks of 1024 threads each (regardless of register usage).

And if your GPU thread code used more than 64 registers per thread (quite possible) then you would not even be able to launch one 1024-thread threadblock per SM (you would get a runtime error at the kernel launch). You would have to reduce the total threads, in a single threadblock.

the number of registers per SM should be quite more, unless thread block can use all the registers assigned to particular SM.

it is not more, and yes, a single threadblock (on Turing) can consume all the resources.

I couldn't find any credible sources discussing the overall idea.

The programming guide has the relevant specs, for each GPU type (architecture).

What would be the general way of finding out the number of registers per SM assuming i don't want to use regsPerMultiprocessor?

You're imagining there is some other number or relevant spec. There isn't. The number of registers per SM is the regsPerMultiprocessor. And it is 65536 for a Turing SM. And that could limit the total number of threads you can deposit on that SM.

If you're asking for a non-programmatic method (I don't think you are), the information is available in the programming guide.

There is no way to retrieve it using nvidia-smi or nvidia-settings. The way you do it in CUDA is exactly the way you have shown, via cudaGetDeviceProperties.

Also, as long as you have met the requirements for a threadblock to be able to be scheduled on a SM, those calculations don't limit the total number of threadblocks you can launch. Once the SMs are "full", additional threadblocks in your grid wait in a queue for spots to open up on the SMs.