I have already read several threads about the capacity about the GPU and understood that the concept of blocks and threads has to be seperated from the physical Hardware. Although the maximum amount of threads per block is 1024, there is no limit on the number of blocks one can use. However, as the number of streaming processors is finite, there has to be a physical limit. After I wrote a GPU program, I would be interested in evaluating the used capacity of my GPU. To do this, I have to know how many threads I could start theoretically at one time on hardware. My graphics card is a Nvidia Geforce 1080Ti, so I have 3584 CUDA-Cores. As far as I understood, each Cuda core executes one Thread, so in theory, I would be able to execute 3584 threads per cycle. Is this correct?
Another question is the one about memory. I installed and used nvprof to get some insight into the used kernels. What is displayed there is for example the number of used registers. I transfer my arrays to the GPU using cuda.to_device (in Python Numba) and as far as I understood, the arrays then reside in global memory. How do I find out how big this global memory is? Is it equivalent to the DRAM size?
Thanks in advance
I'll focus on the first part of the question. The second should really be its own separate question.
CUDA cores do not map 1-to-1 to threads. They are more like ports in a multiscalar CPU. Multiple threads can issue instructions to the same CUDA core in different clock cycles. Sort of like hyperthreading in a CPU.
You can see the relation and numbers in the documentation in chapter K Compute Capabilities compared to the table Table 3. Throughput of Native Arithmetic Instructions. Depending on your architecture, you may have for example for your card (compute capability 6.1) 2048 threads per SM and 128 32 bit floating point operations per clock cycle. That means you have 128 CUDA cores shared by a maximum of 2048 threads.
Within one GPU generation, the absolute number of threads and CUDA cores only scales with the number of multiprocessors (SMs). TechPowerup's excellent GPU database documents 28 SMs for your card which should give you 28 * 2048 threads unless I did something wrong.