Search code examples
cudanvidiansight

How to measure the amount of data copied in NVIDIA nsight systems?


Trivia

In NVIDIA Nsight Systems you can use the --stats=true flag to get the details for data transfer between GPU and CPU. The output includes a section similar to what follows:

CUDA Memory Operation Statistics (KiB)

              Total      Operations              Average            Minimum              Maximum  Name                                                                            
-------------------  --------------  -------------------  -----------------  -------------------  -------------------
           8192.000               2             4096.000           4096.000             4096.000  [CUDA memcpy HtoD]                                                              
         528384.000               2           264192.000           4096.000           524288.000  [CUDA memcpy DtoD] 

Question

Is it possible to get this statistics per API call? That is, can we get the amount of data transferred between Host/Device in each of the cudaMemCpyxxx calls?


Solution

  • If you want to do this purely from the CLI, I suggest following the guidance given in this blog starting at "Extending the Summary Statistics". The basic steps are to export the profile data as a sqlite database, then formulate a database query to extract the data that you want. I acknowledge this is not a compelete recipe.

    If using the GUI is acceptable, I think it is pretty straightforward. Suppose we had a very simple CUDA program:

    int main(){
    
            int *d1, *d2;
            int *h1, *h2;
            h1 = new int[8192];
            h2 = new int[262144];
            cudaMalloc(&d1, 32768);
            cudaMalloc(&d2, 1048576);
            cudaMemcpy(d1, h1, 32768, cudaMemcpyHostToDevice);
            cudaMemcpy(d2, h2, 1048576, cudaMemcpyHostToDevice);
    }
    

    These are the steps:

    1. You could either do interactive profiling directly from the GUI as covered here or you could start with the CLI. To start with the CLI, run a command like this:

      nsys profile --trace=cuda ./my_app
      

      among other activities, this will create a report file of the name reportX.qdrep where X is really a number like 1, or 2, or 3, etc.

    2. Open up the GUI, and File...Open the above reportX.qdrep file. In this case, the GUI need not be on the same machine, but it should be of a version greater than or equal to the CLI version used to create the report file.

    3. Fully expand all the rows in the timeline pertaining to the CUDA activities.

    4. Hover your mouse over the desired operation of interest:

    enter image description here