Search code examples
cudagpunvidiansight-compute

When does MIO Throttle stall happen?


According to this link https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html:

Warp was stalled waiting for the MIO (memory input/output) instruction queue to be not full. This stall reason is high in cases of extreme utilization of the MIO pipelines, which include special math instructions, dynamic branches, as well as shared memory instructions.

And according to this one https://docs.nvidia.com/drive/drive_os_5.1.12.0L/nsight-graphics/activities/index.html:

May be triggered by local, global, shared, attribute, IPA, indexed constant loads (LDC), and decoupled math.

My understanding is that all memory operations are executed on LSUs, so I would imagine that they are stored on the same instruction queue together and then executed by the LSU unit. Since they are all queued together, the second interpretation (which includes global memory accesses) makes more sense to me. The problem is that if that's the case, LG Throttle would be unnecessary.

What does MIO Throttle actually imply? Are all memory instructions stored on the same queue?


Solution

  • The MIO is a partition in the NVIDIA SM (starting in Maxwell) that contains execution units shared between the 4 warp schedulers or slower math execution units (e.g. XU pipe).

    Instructions issued to these execution units are first issued into instruction queues allowing the warp schedulers to continue to issue independent instructions from the warp. If a warp's next instruction is to an instruction queue that is full then the warp is stalled until the queue is not full and the instruction can be enqueued. When this stall occurs the warp will report a throttle reason based upon the instruction queue type. The mapping of instruction queues to pipes differs between chips. This is the general mapping.

    • mio_throttle (ADU, CBU, LSU, XU)
    • lg_throttle (LSU)
      • lg_throttle is used if MIO instruction queue reaches a watermark for local/global instructions. Throttling local/global instructions early allows SM to continue to issue shared memory instructions when L1 backpressure due to local/global L1 misses.
    • tex_throttle (TEX, FP64 on non-*100 chips, Tensor on TU11x)

    If the warp's next instruction to issue is to a sub-partition specific execution unit (FMA, ALU, Tensor, FP64 (*100 GPUs) then the stall reason is math_throttle.