First use of AVX 256-bit vectors slows down 128-bit vector and AVX scalar ops

Originally I was trying to reproduce the effect described in Agner Fog's microarchitecture guide section "Warm-up period for YMM and ZMM vector instructions" where it says that:

The processor turns off the upper parts of the vector execution units when it is not used, in order to save power. Instructions with 256-bit vectors have a throughput that is approximately 4.5 times slower than normal during an initial warm-up period of approximately 56,000 clock cycles or 14 μs.

I got the slowdown, although it seems like it was closer to ~2x instead of 4.5x. But what I've found is on my CPU (Intel i7-9750H Coffee Lake) the slowdown is not only affecting 256-bit operations, but also 128-bit vector ops and scalar floating point ops (and even N number of GPR-only instructions following XMM touching instruction).

Code of the benchmark program:

# Compile and run:
# clang++ ymm-throttle.S && ./a.out

.intel_syntax noprefix

.data
L_F0:
  .asciz "ref cycles = %u\n"

.p2align 5
L_C0:
  .long 1
  .long 2
  .long 3
  .long 4
  .long 1
  .long 2
  .long 3
  .long 4

.text

.set initial_scalar_warmup, 5*1000*1000
.set iteration_count, 30*1000
.set wait_count, 50*1000

.global _main
_main:
  # ---------- Initial warm-up
  # It seems that we enter _main (at least in MacOS 11.2.2) in a "ymm warmed-up" state.
  #
  # Initial warm-up loop below is long enough for the processor to switch back to
  # "ymm cold" state. It also may reduce dynamic-frequency scaling related measurements
  # deviations (hopefully CPU is in full boost by the time we finish initial warmup loop).

  vzeroupper

  push rbp
  mov ecx, initial_scalar_warmup

.p2align 4
_initial_loop:
  add eax, 1
  add edi, 1
  add edx, 1

  dec ecx
  jnz _initial_loop

  # --------- Measure XMM

  # TOUCH YMM.
  # Test to see effect of touching unrelated YMM register
  # on XMM performance.
  # If "vpxor ymm9" below is commented out, then the xmm_loop below
  # runs a lot faster (~2x faster).
  vpxor ymm9, ymm9, ymm9

  mov ecx, iteration_count
  rdtsc
  mov esi, eax

  vpxor xmm0, xmm0, xmm0
  vpxor xmm1, xmm1, xmm1
  vpxor xmm2, xmm2, xmm2
  vmovdqa xmm3, [rip + L_C0]

.p2align 5
_xmm_loop:
  # Here we only do XMM (128-bit) VEX-encoded op. But it is triggering execution throttling.
  vpaddd xmm0, xmm3, xmm3
  add edi, 1
  add eax, 1

  dec ecx
  jnz _xmm_loop

  lfence
  rdtsc
  sub eax, esi
  mov esi, eax  # ESI = ref cycles count

  # ------------- Print results

  lea rdi, [rip + L_F0]
  xor eax, eax
  call _printf

  vzeroupper
  xor eax, eax
  pop rbp
  ret

Question: Is my benchmark correct? Is the description (below) of what's happening seem plausible?

CPU is in AVX-cold state (no 256-bit/512-bit instruction has been executed for ~675 µs) encounters a single instruction with YMM (ZMM) destination register. CPU immediately switches to some sort of "transition to AVX-warm" state. This presumably takes ~100-200 cycles mentioned in Agner's guide. And this "transition" period lasts ~56'000 cycles.

During transition period GPR code may execute normally, but any instruction that has vector destination register (including 128-bit XMM or scalar floating point instructions, even including vmovq xmm0, rax) applies throttling to entire execution pipeline. This affects GPR-only code immediately following such instruction for N-cycles (not sure how many; may be ~dozen cycles worth of instructions).

Perhaps throttling limits number of µops dispatched to execution units (regardless of what those µops are; as long as there is at least one µop with a vector destination register)?

What's new here for me is that I thought that during transition period throttling would be applied only for 256-bit (and 512-bit) instructions, but it seems like any instruction that has vector register destination is affected (as well as ~20-60 of GPR-only immediately following instructions; can't measure more precisely on my system).

Related: "Voltage Only Transitions" section of an article at Travis Downs blog may be describing the same effect. Although the author measured performance of YMM vectors during transition period, the conclusion was that it is not the upper part of the vector that's being split, rather throttling applied to entire pipeline when vector register touching instruction is encountered during transition period. (edit: the blog post did not measure XMM registers during transition period, which is what this post is measuring).

Solution

The fact that you see throttling even for narrow SIMD instructions is a side-effect of a behavior I call implicit widening.

Basically, on modern Intel, if the upper 128-255 bits are dirty on any register in the range ymm0 to ymm15, any SIMD instruction is internally widened to 256 bits, since the upper bits need to be zeroed and this requires the full 256-bit registers in the register file to be powered and probably the 256-bit ALU path as well. So the instruction acts for the purposes of AVX frequencies as if it was 256-bit wide.

Similarly, if bits 256 to 511 are dirty on any zmm register in the range zmm0 to zmm15, operations are implicitly widened to 512 bits.

For the purposes of light vs heavy instructions, the widened instructions have the same type as they would if they were full width. That is, a 128-bit FMA which gets widened to 512 bits acts as "heavy AVX-512" even though only 128 bits of FMA is occurring.

This applies to all instructions which use the xmm/ymm registers, even scalar FP operations.

Note that this doesn't just apply to this throttling period: it means that if you have dirty uppers, a narrow SIMD instruction (or scalar FP) will cause a transition to the more conservative DVFS states just as a full-width instruction would do.