I am trying to get some insight of why my CUDA kernel has a relatively low performance and I am hoping to get some answers with the NVIDIA profiler.
My CUDA program is a 'boiled down' version of a larger application, isolating and exercising the kernel in question. The program launches the kernel several times in order to measure it's execution time as a mean over multiple launches. After the timing loop a memory copy from device to host is issued to make sure all kernel calls have finished. The program is written in CUDA C++.
This is how I built the program:
main.o: main.cu
nvcc -res-usage -arch=sm_61 -c $<
main: main.o stopwatch.o
g++ -o $@ $^ -lcudart -L/usr/local/cuda-11.0/lib64
This test was done on a PC with Intel CPU and an NVIDIA GeForce GTX 1070. The OS is Ubuntu 20.04 with a freshly installed CUDA 11 from the NVIDIA website along with driver 450.51.06:
nvidia-smi:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.06 Driver Version: 450.51.06 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce GTX 1070 On | 00000000:01:00.0 On | N/A |
| 28% 38C P8 8W / 151W | 317MiB / 8111MiB | 3% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
The following command was used to generate the profiling file:
sudo /usr/local/cuda-11.0/bin/nvprof -o main.nvvp --profile-from-start off ./main
I also tried with profiling from start but it leads to the same issue below.
The following command was used to launch the visual profiler:
nvvp -vm /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java main.nvvp
The Visual profiler walks me through several steps and when it comes to "Perform Kernel Analysis" the program tells me:
Insufficient kernel bounds data. The data needed to calculate compute, memory, and latency bounds for the kernel could not be collected
Is this sort of detailed profiling not available on my GPU? (maybe because it's a gamer card)
nvprof
by default will capture only a small amount of information in the output file it generates. This is enough to generate an application timeline, when the output file is imported into nvvp
, but not enough information to enable all of the different capabilities of nvvp
.
According to the documentation, the --analysis-metrics
switch for nvprof
is recommended for this type of use.
--analysis-metrics
is referred to about 6 different times in the profiler documentation, so you may simply want to search on it to see all of the references or recommendations for its use.
Note that --analysis-metrics
can capture a large amount of information. For a large, complex application, it may substantially increase the times the profilers spend processing data. Therefore if you know specifically which data you are looking for, you may wish to specify specific metrics instead. Without --analysis-metrics
, however, various nvvp
analysis tools may not work correctly when you import the file.