Search code examples
visual-studio-2010cudaprofilernsight

CUDA Perfomance Profiling with Nvidia NSight in VS2010 - .nvreport report file


I did a trace of application

In this report file:

1.

When I select "CUDA -> CUDA Summary" in the drop down

Under the Runtime API calls item in the table

% Time - 80.66

Launches

% Device Time - 15.46

All the other time percentages are nearly 0%

so my question here is that where is the rest of the 19.34% of Time and 84.54% of Device Time? That is, if they mean percentage to completely different 'Total Time' values?

2.

I used thrust vectors to copy back and forth my data. In the "Memory Copy" section of this report, all the % Time values for memo copy for my run are apparently negligible.

But guess what, when I click the 'summary' link of the Runtime API Calls (which has its % Time value as high as 80.66), I immediately see that the culprit - 'cudaMemcpy' with its 'Capture Time %' value as high as 73.75 in this 'Runtime API Calls Summary' page.

so my question here is that

  • does this mean that my bottle neck are still those call to thrust::copy(), even the "Memo Copies" section of the report doesn't show it?
  • and how can I really find the exact function call that is the most expensive to me in general?
  • how does timeline feature help with any of these?

Solution

  • CUDA SUMMARY

    In the CUDA Summary the % Time under Runtime API Calls is the % of CPU time that is taken by the CUDA Runtime. I do not recall if the % is limited to 100% (all CPU threads are flattened) or if the maximum % is NumCpuCores * 100%.

    API CALLS

    In order to find the most expensive Runtime API Calls perform the following steps:

    1. Navigate to the page CUDA Runtime API Calls
    2. Click on the Duration column 2 times to sort Descending

    It is possible capture the call stack for CUDA Runtime API Calls so you can jump to the source code from the report. This can be enabled in the Activity with the following steps:

    1. Navigate to Trace Setings in the Activity
    2. Enable System Trace
    3. Expand the CUDA Trace Settings
    4. Enable Runtime API Trace and Call Stack Trace = Always

    WARNING: Setting Call Stack Trace to Always increases the API call overhead. Only enable this when the program is CPU limited and you are trying to identify the source code generating the API calls.

    The call stack trace can be accessed from report page that references the API call by using the correlation pane in the bottom left corner of the report page. The screen shot below shows the call stack for the cudaEventSynchronize call in the CUDA Runtime API Calls report page.

    Nsight VSE CUDA Runtime API Calls Report Page

    It is possible to query for the longest API calls in the Timeline report page using the correlation information for the Process\Thread\Function Calls or Process\CUDA\CUDA Context\Runtime API rows.

    1. Click on the row containing the API Calls
    2. In the correlation tree click on Row Information\Runtime API
    3. In the table of API calls click 2 times on the Duration column and scroll the table to the top.
    4. Click on the API call to navigate the timeline view to the API call.

    The call stack can also be retrieved at this point using the correlation pane.

    enter image description here