Terminology used in Nsight Compute

Two questions:

According to Nsight Compute, my kernel is compute bound. The SM % of utilization relative to peak performance is 74% and the memory utilization is 47%. However, when I look at each pipeline utilization percentage, LSU utilization is way higher than others (75% vs 10-15%). Wouldn't that be an indication that my kernel is memory bound? If the utilization of compute and memory resources doesn't correspond to pipeline utilization, I don't know how to interpret those terms.
The schedulers are only issuing every 4 cycles, wouldn't that mean that my kernel is latency bound? People usually define that in terms of utilization of compute and memory resources. What is the relationship between both?

Solution

In Nsight Compute on CC7.5 GPUs

SM% is defined by sm__throughput, and Memory% is defined by gpu__compute_memory_throughtput

sm_throughput is the MAX of the following metrics:

sm__instruction_throughput
- sm__inst_executed
- sm__issue_active
- sm__mio_inst_issued
- sm__pipe_alu_cycles_active
- sm__inst_executed_pipe_cbu_pred_on_any
- sm__pipe_fp64_cycles_active
- sm__pipe_tensor_cycles_active
- sm__inst_executed_pipe_xu
- sm__pipe_fma_cycles_active
- sm__inst_executed_pipe_fp16
- sm__pipe_shared_cycles_active
- sm__inst_executed_pipe_uniform
- sm__instruction_throughput_internal_activity
sm__memory_throughput
- idc__request_cycles_active
- sm__inst_executed_pipe_adu
- sm__inst_executed_pipe_ipa
- sm__inst_executed_pipe_lsu
- sm__inst_executed_pipe_tex
- sm__mio_pq_read_cycles_active
- sm__mio_pq_write_cycles_active
- sm__mio2rf_writeback_active
- sm__memory_throughput_internal_activity

gpu__compute_memory_throughput is the MAX of the following metrics:

gpu__compute_memory_access_throughput
- l1tex__data_bank_reads
- l1tex__data_bank_writes
- l1tex__data_pipe_lsu_wavefronts
- l1tex__data_pipe_tex_wavefronts
- l1tex__f_wavefronts
- lts__d_atomic_input_cycles_active
- lts__d_sectors
- lts__t_sectors
- lts__t_tag_requests
- gpu__compute_memory_access_throughput_internal_activity
gpu__compute_memory_access_throughput
l1tex__lsuin_requests
- l1tex__texin_sm2tex_req_cycles_active
- l1tex__lsu_writeback_active
- l1tex__tex_writeback_active
- l1tex__m_l1tex2xbar_req_cycles_active
- l1tex__m_xbar2l1tex_read_sectors
- lts__lts2xbar_cycles_active
- lts__xbar2lts_cycles_active
- lts__d_sectors_fill_device
- lts__d_sectors_fill_sysmem
- gpu__dram_throughput
- gpu__compute_memory_request_throughput_internal_activity

In your case the limiter is sm__inst_executed_pipe_lsu which is an instruction throughput. If you review sections/SpeedOfLight.py latency bound is defined as having both sm__throughput and gpu__compute_memory_throuhgput < 60%.

Some set of instruction pipelines have lower throughput such as fp64, xu, and lsu (varies with chip). The pipeline utilization is part of sm__throughput. In order to improve performance the options are:

Reduce instructions to the oversubscribed pipeline, or
Issue instructions of different type to use empty issue cycles.

GENERATING THE BREAKDOWN

As of Nsight Compute 2020.1 there is not a simple command line to generate the list without running a profiling session. For now you can collect one throughput metric using breakdown:<throughput metric>avg.pct_of_peak_sustained.elapsed and parse the output to get the sub-metric names.

For example:

ncu.exe --csv --metrics breakdown:sm__throughput.avg.pct_of_peak_sustained_elapsed --details-all -c 1 cuda_application.exe

generates:

"ID","Process ID","Process Name","Host Name","Kernel Name","Kernel Time","Context","Stream","Section Name","Metric Name","Metric Unit","Metric Value"
"0","33396","cuda_application.exe","127.0.0.1","kernel()","2020-Aug-20 13:26:26","1","7","Command line profiler metrics","gpu__dram_throughput.avg.pct_of_peak_sustained_elapsed","%","0.38"
"0","33396","cuda_application.exe","127.0.0.1","kernel()","2020-Aug-20 13:26:26","1","7","Command line profiler metrics","l1tex__data_bank_reads.avg.pct_of_peak_sustained_elapsed","%","0.05"
"0","33396","cuda_application.exe","127.0.0.1","kernel()","2020-Aug-20 13:26:26","1","7","Command line profiler metrics","l1tex__data_bank_writes.avg.pct_of_peak_sustained_elapsed","%","0.05"
...

The keyword breakdown can be used in Nsight Compute section files to expand a throughput metric. This is used in the SpeedOfLight.section.