Search code examples
optimizationcudansight-compute

Terminology used in Nsight Compute


Two questions:

  1. According to Nsight Compute, my kernel is compute bound. The SM % of utilization relative to peak performance is 74% and the memory utilization is 47%. However, when I look at each pipeline utilization percentage, LSU utilization is way higher than others (75% vs 10-15%). Wouldn't that be an indication that my kernel is memory bound? If the utilization of compute and memory resources doesn't correspond to pipeline utilization, I don't know how to interpret those terms.

  2. The schedulers are only issuing every 4 cycles, wouldn't that mean that my kernel is latency bound? People usually define that in terms of utilization of compute and memory resources. What is the relationship between both?


Solution

  • In Nsight Compute on CC7.5 GPUs

    SM% is defined by sm__throughput, and Memory% is defined by gpu__compute_memory_throughtput

    sm_throughput is the MAX of the following metrics:

    • sm__instruction_throughput
      • sm__inst_executed
      • sm__issue_active
      • sm__mio_inst_issued
      • sm__pipe_alu_cycles_active
      • sm__inst_executed_pipe_cbu_pred_on_any
      • sm__pipe_fp64_cycles_active
      • sm__pipe_tensor_cycles_active
      • sm__inst_executed_pipe_xu
      • sm__pipe_fma_cycles_active
      • sm__inst_executed_pipe_fp16
      • sm__pipe_shared_cycles_active
      • sm__inst_executed_pipe_uniform
      • sm__instruction_throughput_internal_activity
    • sm__memory_throughput
      • idc__request_cycles_active
      • sm__inst_executed_pipe_adu
      • sm__inst_executed_pipe_ipa
      • sm__inst_executed_pipe_lsu
      • sm__inst_executed_pipe_tex
      • sm__mio_pq_read_cycles_active
      • sm__mio_pq_write_cycles_active
      • sm__mio2rf_writeback_active
      • sm__memory_throughput_internal_activity

    gpu__compute_memory_throughput is the MAX of the following metrics:

    • gpu__compute_memory_access_throughput
      • l1tex__data_bank_reads
      • l1tex__data_bank_writes
      • l1tex__data_pipe_lsu_wavefronts
      • l1tex__data_pipe_tex_wavefronts
      • l1tex__f_wavefronts
      • lts__d_atomic_input_cycles_active
      • lts__d_sectors
      • lts__t_sectors
      • lts__t_tag_requests
      • gpu__compute_memory_access_throughput_internal_activity
    • gpu__compute_memory_access_throughput
    • l1tex__lsuin_requests
      • l1tex__texin_sm2tex_req_cycles_active
      • l1tex__lsu_writeback_active
      • l1tex__tex_writeback_active
      • l1tex__m_l1tex2xbar_req_cycles_active
      • l1tex__m_xbar2l1tex_read_sectors
      • lts__lts2xbar_cycles_active
      • lts__xbar2lts_cycles_active
      • lts__d_sectors_fill_device
      • lts__d_sectors_fill_sysmem
      • gpu__dram_throughput
      • gpu__compute_memory_request_throughput_internal_activity

    In your case the limiter is sm__inst_executed_pipe_lsu which is an instruction throughput. If you review sections/SpeedOfLight.py latency bound is defined as having both sm__throughput and gpu__compute_memory_throuhgput < 60%.

    Some set of instruction pipelines have lower throughput such as fp64, xu, and lsu (varies with chip). The pipeline utilization is part of sm__throughput. In order to improve performance the options are:

    1. Reduce instructions to the oversubscribed pipeline, or
    2. Issue instructions of different type to use empty issue cycles.

    GENERATING THE BREAKDOWN

    As of Nsight Compute 2020.1 there is not a simple command line to generate the list without running a profiling session. For now you can collect one throughput metric using breakdown:<throughput metric>avg.pct_of_peak_sustained.elapsed and parse the output to get the sub-metric names.

    For example:

    ncu.exe --csv --metrics breakdown:sm__throughput.avg.pct_of_peak_sustained_elapsed --details-all -c 1 cuda_application.exe
    

    generates:

    "ID","Process ID","Process Name","Host Name","Kernel Name","Kernel Time","Context","Stream","Section Name","Metric Name","Metric Unit","Metric Value"
    "0","33396","cuda_application.exe","127.0.0.1","kernel()","2020-Aug-20 13:26:26","1","7","Command line profiler metrics","gpu__dram_throughput.avg.pct_of_peak_sustained_elapsed","%","0.38"
    "0","33396","cuda_application.exe","127.0.0.1","kernel()","2020-Aug-20 13:26:26","1","7","Command line profiler metrics","l1tex__data_bank_reads.avg.pct_of_peak_sustained_elapsed","%","0.05"
    "0","33396","cuda_application.exe","127.0.0.1","kernel()","2020-Aug-20 13:26:26","1","7","Command line profiler metrics","l1tex__data_bank_writes.avg.pct_of_peak_sustained_elapsed","%","0.05"
    ...
    

    The keyword breakdown can be used in Nsight Compute section files to expand a throughput metric. This is used in the SpeedOfLight.section.