Search code examples
cudansight

NSIGHT: What are those Red and Black colour in kernel-level experiments?


I am trying to learn NSIGHT.

Can some one tell me what are these red marks indicating in the following screenshot taken from the User Guide ? There are two red marks in Occupancy per SM and two in warps section as you can see.

Similarly what are those black lines which are varying in length, indicating?

enter image description here

Another example from same page:

enter image description here


Solution

  • Here is the basic explanation:

    • Grey bars represent the available amount of resources your particular device has (due to both its hardware and its compute capability).
    • Black bars represent the theoretical limit that it is possible to achieve for your kernel under your launch configuration (blocks per grid and threads per block)
    • The red dots represent your the resources that you are using.

    For instance, looking at "Active warps" on the first picture:

    • Grey: The device supports 64 active warps concurrently.
    • Black: Because of the use of the registers, it is theoretically possible to map 64 warps.
    • Red: Your achieve 63.56 active warps.

    In such case, the grey bar is under the black one, so you cant see the grey one.

    In some cases, can happen that the theoretical limit its greater that the device limit. This is OK. You can see examples on the second picture (block limit (shared memory) and block limit (registers). That makes sense if you think that your kernel use only a little fraction of your resources; If one block uses 1 register, it could be possible to launch 65536 blocks (without taking into account other factors), but still your device limit is 16. Then, the number 128 comes from 65536/512. The same applies to the shared memory section: since you use 0 bytes of shared memory per block, you could launch infinite number of block according to shared memory limitations.

    About blank spaces The theoretical and the achieved values are the same for all rows except for "Active warps" and "Occupancy". You are really executing 1024 threads per block with 32 warps per block on the first picture. In the case of Occupancy and Active warps I guess the achieved number is a kind of statistical measure. I think that because of the nature of the CUDA model. In CUDA each thread within a warp is executed simultaneously on a SM. The way of hiding high latency operations -such as memory readings- is through "almost-free warps context switches". I guess that should be difficult to take a exact measure of the number of active warps in that situation. Beside hardware concepts, we also have to take into account the kernel implementation, branch-divergence, for instance could make a warp to slower than others... etc.

    Extended information

    As you saw, these numbers are closely related to your device specific hardware and compute capability, so perhaps a concrete example could help here:

    A devide with CCC 3.0 can handle a maximum of 2048 threads per SM, 16 blocks per SM and 64 warps per SM. You also have a maximum number of registers avaliable to use (65536 on that case).

    This wikipedia entry is a handy site to be aware of each ccc features.

    You can query this parameters using the deviceQuery utility sample code provided with the CUDA toolkit or, at execution time using the CUDA API as here.

    Performance considerations

    The thing is that, ideally, 16 blocks of 128 threads could be executed using less than 32 registers per thread. That means a high occupancy rate. In most cases your kernel needs more that 32 register per block, so it is no longer possible to execute 16 blocks concurrently on the SM, then the reduction is done at the block level granularity, i.e., decreasing the number of block. An this is what the bars capture.

    You can play with the number of threads and blocks, or even with the _ _launch_bounds_ _ directive to optimize your kernel, or you can use the --maxrregcount setting to lower the number of registers used by a single kernel to see if it improves overall execution speed.