The number shown in the square brackets after the kernel name correlates to the CUDA API that launched that kernel. (from GPU-Trace and API-Trace Modes)
The number shown in the square brackets after the kernel name are
So what exactly is CUDA API [94](and other) in NVIDIA CUDA Runtime API?
==27706== Profiling application: matrixMul
[Matrix Multiply Using CUDA] - Starting...
GPU Device 0: "GeForce GT 640M LE" with compute capability 3.0
MatrixA(320,320), MatrixB(640,320)
Computing result using CUDA Kernel...
done
Performance= 35.36 GFlop/s, Time= 3.707 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: OK
Note: For peak performance, please refer to the matrixMulCUBLAS example.
==27706== Profiling result:
Start Duration Grid Size Block Size Regs* SSMem* DSMem* Size Throughput Device Context Stream Name
133.81ms 135.78us - - - - - 409.60KB 3.0167GB/s GeForce GT 640M 1 2 [CUDA memcpy HtoD]
134.62ms 270.66us - - - - - 819.20KB 3.0267GB/s GeForce GT 640M 1 2 [CUDA memcpy HtoD]
134.90ms 3.7037ms (20 10 1) (32 32 1) 29 8.1920KB 0B - - GeForce GT 640M 1 2 void matrixMulCUDA<int=32>(float*, float*, float*, int, int) [94]
138.71ms 3.7011ms (20 10 1) (32 32 1) 29 8.1920KB 0B - - GeForce GT 640M 1 2 void matrixMulCUDA<int=32>(float*, float*, float*, int, int) [105]
<...more output...>
1.24341s 3.7011ms (20 10 1) (32 32 1) 29 8.1920KB 0B - - GeForce GT 640M 1 2 void matrixMulCUDA<int=32>(float*, float*, float*, int, int) [2191]
1.24711s 3.7046ms (20 10 1) (32 32 1) 29 8.1920KB 0B - - GeForce GT 640M 1 2 void matrixMulCUDA<int=32>(float*, float*, float*, int, int) [2198]
1.25089s 248.13us - - - - - 819.20KB 3.3015GB/s GeForce GT 640M 1 2 [CUDA memcpy DtoH]
Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
It might be clearer if it said:
The number shown in the square brackets after the kernel name correlates to the CUDA API call that launched that kernel.
If you run a given code using the --print-api-trace
option, you'll get a sequential list of all the CUDA API calls issued by that application. If you were to number those in order, the number associated with a particular kernel launch would be shown in the square brackets in the --print-gpu-trace
output.
Here is a fully-worked example. Note the correlation between [105]
, [106]
, and [108]
in the api-trace output and in the gpu-trace output:
$ cat t1.cu
__global__ void k(){}
int main(){
k<<<1,1>>>();
k<<<1,1>>>();
cudaDeviceSynchronize();
k<<<1,1>>>();
cudaDeviceSynchronize();
}
$ nvcc -o t1 t1.cu
$ nvprof --print-api-trace ./t1
==7206== NVPROF is profiling process 7206, command: ./t1
==7206== Profiling application: ./t1
==7206== Profiling result:
Start Duration Name
116.17ms 3.0990us cuDeviceGetPCIBusId
130.20ms 800ns cuDeviceGetCount
130.20ms 251ns cuDeviceGetCount
130.41ms 1.0500us cuDeviceGet
130.41ms 705ns cuDeviceGetAttribute
130.42ms 539ns cuDeviceGetAttribute
130.42ms 547ns cuDeviceGetAttribute
130.46ms 525ns cuDeviceGetCount
130.46ms 277ns cuDeviceGet
130.46ms 59.680us cuDeviceGetName
130.52ms 63.802us cuDeviceTotalMem
130.59ms 497ns cuDeviceGetAttribute
130.59ms 226ns cuDeviceGetAttribute
130.59ms 282ns cuDeviceGetAttribute
130.59ms 234ns cuDeviceGetAttribute
130.59ms 229ns cuDeviceGetAttribute
130.59ms 34.628us cuDeviceGetAttribute
130.62ms 372ns cuDeviceGetAttribute
130.63ms 220ns cuDeviceGetAttribute
130.63ms 284ns cuDeviceGetAttribute
130.63ms 237ns cuDeviceGetAttribute
130.63ms 222ns cuDeviceGetAttribute
130.63ms 231ns cuDeviceGetAttribute
130.63ms 288ns cuDeviceGetAttribute
130.63ms 219ns cuDeviceGetAttribute
130.63ms 3.1870us cuDeviceGetAttribute
130.63ms 211ns cuDeviceGetAttribute
130.63ms 211ns cuDeviceGetAttribute
130.63ms 211ns cuDeviceGetAttribute
130.63ms 275ns cuDeviceGetAttribute
130.63ms 211ns cuDeviceGetAttribute
130.63ms 213ns cuDeviceGetAttribute
130.64ms 211ns cuDeviceGetAttribute
130.64ms 211ns cuDeviceGetAttribute
130.64ms 336ns cuDeviceGetAttribute
130.64ms 210ns cuDeviceGetAttribute
130.64ms 211ns cuDeviceGetAttribute
130.64ms 214ns cuDeviceGetAttribute
130.64ms 214ns cuDeviceGetAttribute
130.64ms 210ns cuDeviceGetAttribute
130.64ms 216ns cuDeviceGetAttribute
130.64ms 212ns cuDeviceGetAttribute
130.64ms 211ns cuDeviceGetAttribute
130.64ms 211ns cuDeviceGetAttribute
130.64ms 211ns cuDeviceGetAttribute
130.64ms 216ns cuDeviceGetAttribute
130.64ms 212ns cuDeviceGetAttribute
130.64ms 212ns cuDeviceGetAttribute
130.64ms 212ns cuDeviceGetAttribute
130.64ms 214ns cuDeviceGetAttribute
130.64ms 211ns cuDeviceGetAttribute
130.64ms 211ns cuDeviceGetAttribute
130.64ms 211ns cuDeviceGetAttribute
130.65ms 211ns cuDeviceGetAttribute
130.65ms 211ns cuDeviceGetAttribute
130.65ms 211ns cuDeviceGetAttribute
130.65ms 211ns cuDeviceGetAttribute
130.65ms 211ns cuDeviceGetAttribute
130.65ms 211ns cuDeviceGetAttribute
130.65ms 213ns cuDeviceGetAttribute
130.65ms 212ns cuDeviceGetAttribute
130.65ms 211ns cuDeviceGetAttribute
130.65ms 211ns cuDeviceGetAttribute
130.65ms 210ns cuDeviceGetAttribute
130.65ms 215ns cuDeviceGetAttribute
130.65ms 212ns cuDeviceGetAttribute
130.65ms 320.65us cuDeviceGetAttribute
130.97ms 322ns cuDeviceGetAttribute
130.97ms 206ns cuDeviceGetAttribute
130.97ms 218ns cuDeviceGetAttribute
130.97ms 212ns cuDeviceGetAttribute
130.97ms 212ns cuDeviceGetAttribute
130.98ms 226ns cuDeviceGetAttribute
130.98ms 220ns cuDeviceGetAttribute
130.98ms 212ns cuDeviceGetAttribute
130.98ms 210ns cuDeviceGetAttribute
130.98ms 206ns cuDeviceGetAttribute
130.98ms 207ns cuDeviceGetAttribute
130.98ms 209ns cuDeviceGetAttribute
130.98ms 211ns cuDeviceGetAttribute
130.98ms 208ns cuDeviceGetAttribute
130.98ms 208ns cuDeviceGetAttribute
130.98ms 229ns cuDeviceGetAttribute
130.98ms 215ns cuDeviceGetAttribute
130.98ms 216ns cuDeviceGetAttribute
130.98ms 209ns cuDeviceGetAttribute
130.98ms 316.59us cuDeviceGetAttribute
131.30ms 266ns cuDeviceGetAttribute
131.30ms 252ns cuDeviceGetAttribute
131.30ms 212ns cuDeviceGetAttribute
131.30ms 235ns cuDeviceGetAttribute
131.30ms 209ns cuDeviceGetAttribute
131.30ms 272ns cuDeviceGetAttribute
131.30ms 207ns cuDeviceGetAttribute
131.30ms 735ns cuDeviceGetAttribute
131.30ms 254ns cuDeviceGetAttribute
131.30ms 208ns cuDeviceGetAttribute
131.30ms 208ns cuDeviceGetAttribute
131.30ms 610ns cuDeviceGetAttribute
131.31ms 273ns cuDeviceGetAttribute
131.31ms 412ns cuDeviceGetAttribute
131.31ms 216ns cuDeviceGetAttribute
131.31ms 211ns cuDeviceGetAttribute
131.31ms 205ns cuDeviceGetAttribute
131.31ms 59.911ms cudaLaunchKernel (k(void) [105])
191.23ms 11.222us cudaLaunchKernel (k(void) [106])
191.24ms 5.7860us cudaDeviceSynchronize
191.25ms 9.2890us cudaLaunchKernel (k(void) [108])
191.26ms 5.1790us cudaDeviceSynchronize
$ nvprof --print-gpu-trace ./t1
==7224== NVPROF is profiling process 7224, command: ./t1
==7224== Profiling application: ./t1
==7224== Profiling result:
Start Duration Grid Size Block Size Regs* SSMem* DSMem* Device Context Stream Name
191.20ms 1.6000us (1 1 1) (1 1 1) 8 0B 0B Quadro K2000 (0 1 7 k(void) [105]
191.22ms 896ns (1 1 1) (1 1 1) 8 0B 0B Quadro K2000 (0 1 7 k(void) [106]
191.23ms 928ns (1 1 1) (1 1 1) 8 0B 0B Quadro K2000 (0 1 7 k(void) [108]
Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
$