Meaning of the output from CUPTI PC Sampling API

I am running a CUPTI PC sampling sample code pc_sampling_start_stop.cu where few of the output terms are:

pcOffset : 0x90
range id: 2
total samples: 387
dropped samples: 0
non user kernels total samples: 383

Although I have specified #define NUM_PC_COLLECT 100 in the code, why do I get total samples: 387 instead of 100?

Can someone please explain me what do these terms mean? Or are there any resources that explain these terms? How can I use the stallReasons to point problems in my kernel? The CUPTI user guide doesn't explain the terms in depth or how to use this knowledge to point at problems in your code/kernel.

Solution

NVIDIA GPUs support a SM Program Counter sampler. The SM sampler can be programmed to sample every 2^(5+N) active cycles. On each sample period the sampler round robin picks and active warp and outputs the warps Program Counter (address on next instruction to issue), warp state (stall reason), and if the select warp scheduler issued an instruction on the selection cycle.

CUPTI allows the user to define the memory buffer for the output data. The data is in the form

PC : <counter0> <counter1> ... <counterN>

where counterN is a select warp stall reasons.

NUM_PC_COLLECT defines the number of rows. The size of the rows is defined by the number of enabled counters. The number of samples is dependent on the sampling period, SM frequency, number of active SMs, etc.

CUpti_PCSamplingPCData contains information on the information on the data per unique instruction. The PC or Program Counter is the address of the next instruction to execute. During execution the PC is represented as a 64-bit virtual address. CUPTI converts this to an offset in a function to make it easier to aggregate information from multiple execution of the CUDA application. The PCSamplingPCData struct contains fields to define the function and attributes of the function.

CUpti_PCSamplingData contains information on each of the fields.

uint64_t CUpti_PCSamplingData::rangeId

Unique identifier for each range. Data collected across multiple ranges in multiple buffers can be identified using range id.

uint64_t CUpti_PCSamplingData::totalSamples

Number of samples collected across all PCs. It includes samples for user modules, samples for non-user kernels and dropped samples. It includes counts for all non selected stall reasons. CUPTI does not provide PC records for non-user kernels. CUPTI does not provide PC records for instructions for which all selected stall reason metrics counts are zero.

uint64_t CUpti_PCSamplingData::droppedSamples

Number of samples that were dropped by hardware due to backpressure/overflow.

uint64_t CUpti_PCSamplingData::nonUsrKernelsTotalSamples

Number of samples collected across all non user kernels PCs. It includes samples for non-user kernels. It includes counts for all non selected stall reasons as well. CUPTI does not provide PC records for non-user kernels.

For more information on CUPTI PC Sampling please see CUPTI PC Sampling API.

How can I use the stallReasons to point problems in my kernel? The CUPTI user guide doesn't explain the terms in depth or how to use this knowledge to point at problems in your code/kernel.

I would recommend that you run your program through Nsight Compute to determine how you would use the CUPTI information. In addition to collecting the information there is additional work to

disassemble the kernel
annotate the disassembly with CUpti_PCSamplingPCData data
correlate the disassembly to source code via line tables
roll the per instruction counters up to source line counters

For more information on sampling and warp state/stall reasons refer to Nsight Compute Kernel Profiling Guide section on Sampling.