Search code examples
performancecudaprofilingnsight

How to get NSight Kernel Launch Timing from a Loop


*This is a more specific, better-formed question of something I already asked. I deleted the other one.

So I'm trying to collect kernel timing data from a CUDA library...

The library has benchmarks for different types for each of its algorithms and they work like this:

There is a 2d array which has pairs of array sizes & test iterations. Example:

const int Tests[][2] = {
    { 10000, 10000 },
    { 50000, 5000 },
    { 100000, 5000 },
    { 200000, 2000 }
    // ...
};

Then in main there will be a loop

// get context ptr
for(int test = 0; test < numTests; ++test)
    BenchmarkMyAlg(Tests[test][0], Tests[test][1], *context);

BenchmarkMyAlg sets up the data and everything, then runs the kernel in a loop (Tests[test][1] times)

What I want to do is to get the "CUDA Launch Summary" data, specifically "average duration for executing the device function in microseconds," for each test parameter pair. I.e. for each iteration of that loop in main.

As it is now, I am only able to get the average timing for entire main loop. To put it another way, I can only get 1 row of NSight data after the application executes and I want numTests rows of data.

If a 2nd, different algorithm is tested in main, NSight will make another row of data. e.g...

for(int test = 0; test < numTests; ++test)
    BenchmarkMyAlg(Tests[test][0], Tests[test][1], *context);

for(int test = 0; test < numTests; ++test)
    BenchmarkMyOtherAlg(Tests[test][0], Tests[test][1], *context);

But again, that new row of data refers to the whole loop, giving me 2 rows of data when I want 2 * numTests rows of data.

I've tried digging through settings in NSight and I've also tinkered with nvprof some, but I haven't made any progress.

I'm thinking there is a way I could re-code the file so that NSight would recognize each test iteration as a new/different kernel like it does when actually switching to a different kernel (like in my 2nd example). Perhaps initializing numTests separate references to the BenchmarkMyAlg function and then running through those? I'll go try that for now and comment back if I get anywhere.


Solution

  • With nvprof you should be able to get the combined results (min, max, avg) or the trace for each individual invocation (using --print-gpu-trace). What you want is something in between, you want the timings to be grouped. That's not possible for the tool to do on its own since your kernel has a single name and hence it cannot distinguish between the groups (it would need to inspect the arguments for that, which would be a big overhead).

    One way to get what you want would be to post-process the full GPU trace to manually group the individual invocations - the trace is chronological so should be straightforward.

    Another way would be to templatise your kernel, for example with the test number. Even if you don't actually use the template argument inside your kernel it will force each test to have a different kernel name which makes the default aggregation in nvprof (and nsight) do what you want.