NVIDIA Visual Profiler: Insufficient kernel bounds data

I am trying to get some insight of why my CUDA kernel has a relatively low performance and I am hoping to get some answers with the NVIDIA profiler.

My CUDA program is a 'boiled down' version of a larger application, isolating and exercising the kernel in question. The program launches the kernel several times in order to measure it's execution time as a mean over multiple launches. After the timing loop a memory copy from device to host is issued to make sure all kernel calls have finished. The program is written in CUDA C++.

This is how I built the program:

main.o: main.cu
    nvcc -res-usage -arch=sm_61  -c $<

main: main.o stopwatch.o
    g++ -o $@ $^ -lcudart -L/usr/local/cuda-11.0/lib64

This test was done on a PC with Intel CPU and an NVIDIA GeForce GTX 1070. The OS is Ubuntu 20.04 with a freshly installed CUDA 11 from the NVIDIA website along with driver 450.51.06:

nvidia-smi:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.06    Driver Version: 450.51.06    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 1070    On   | 00000000:01:00.0  On |                  N/A |
| 28%   38C    P8     8W / 151W |    317MiB /  8111MiB |      3%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

The following command was used to generate the profiling file:

sudo /usr/local/cuda-11.0/bin/nvprof -o main.nvvp --profile-from-start off ./main

I also tried with profiling from start but it leads to the same issue below.

The following command was used to launch the visual profiler:

nvvp -vm /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java main.nvvp

The Visual profiler walks me through several steps and when it comes to "Perform Kernel Analysis" the program tells me:

Insufficient kernel bounds data. The data needed to calculate compute, memory, and latency bounds for the kernel could not be collected

Is this sort of detailed profiling not available on my GPU? (maybe because it's a gamer card)

Solution

nvprof by default will capture only a small amount of information in the output file it generates. This is enough to generate an application timeline, when the output file is imported into nvvp, but not enough information to enable all of the different capabilities of nvvp.

According to the documentation, the --analysis-metrics switch for nvprof is recommended for this type of use.

--analysis-metrics is referred to about 6 different times in the profiler documentation, so you may simply want to search on it to see all of the references or recommendations for its use.

Note that --analysis-metrics can capture a large amount of information. For a large, complex application, it may substantially increase the times the profilers spend processing data. Therefore if you know specifically which data you are looking for, you may wish to specify specific metrics instead. Without --analysis-metrics, however, various nvvp analysis tools may not work correctly when you import the file.