Search code examples
x86-64intelcpu-architecturemicro-optimizationaddressing-mode

Bottleneck when using indexed addressing modes


I performed the following experiments both on a Haswell and a Coffee Lake machine.

The instruction

cmp rbx, qword ptr [r14+rax]

has a throughput of 0.5 (i.e., 2 instructions per cycle). This is as expected. The instruction is decoded to one µop that is later unlaminated (see https://stackoverflow.com/a/31027695/10461973) and, thus, requires two retire slots.

If we add a nop instruction

cmp rbx, qword ptr [r14+rax]; nop

I would expect a throughput of 0.75, as this sequence requires 3 retire slots, and there also seem to be no other bottlenecks in the back-end. This is also the throughput that IACA reports. However, the actual throughput is 1 (this is independent of whether the µops come from the decoders or the DSB). What is the bottleneck in this case?

Without the indexed addressing mode,

cmp rbx, qword ptr [r14]; nop

has a throughput of 0.5, as expected.


Solution

  • It seems you've uncovered a downside to unlamination vs. regular multi-uop instructions, perhaps in the interaction with 4-wide issue/rename/allocate when a micro-fused uop reaches the head of the IDQ.

    Hypothesis: maybe both uops resulting from un-lamination have to be part of the same issue group, so unlaminated; nop repeated only achieves a front-end throughput of 3 fused-domain uops per clock.

    That might make sense if un-lamination only happens at the head of the IDQ, as they reach the alloc/rename stage. Rather than as they're added to the IDQ. To test this, we could see if LSD (loop buffer) capacity on Haswell depends on uop count before or after unlamination - @AndreasAbel's testing shows that a loop containing 55x cmp rbx, [r14+rax] runs from the LSD on Haswell, so that's strong evidence that unlamination happens during alloc/rename, not taking multiple entries in the IDQ itself.


    For comparison, cmp dword [rip+rel32], 1 won't micro-fuse in the first place, in the decoders, so it won't un-laminate. If it achieves 0.75c throughput, that would be evidence in support of un-lamination requiring room in the same issue group.

    Perhaps times 2 nop; unlaminate or times 3 nop could also be an interesting test to see if the unlaminated uop ever issues by itself or can reliably grab 2 more NOPs after it's delayed from whatever position in an issue group. From your back-to-back cmp-unlaminate test, I expect we'd still see mostly full 4-uop issue groups.


    Your question mentions retirement but not issue.

    Retire is at least as wide as issue (4-wide from Core2 to Skylake, 5-wide in Ice Lake).

    Sandybridge / Haswell retire 4 fused-domain uops/clock. Skylake can retire 4 fused-domain uops per clock per hyperthread, allowing quicker release of resources like load buffers after one old stalled uop finally completes, if both logical cores are busy. It's not 100% clear whether it can retire 8/clock when running in single-thread mode, I found conflicting claims, and no clear statement in Intel's optimization manual.

    It's very hard if not impossible to actually create a bottleneck on retirement (but not issue). Any sustained stream has to get through the issue stage, which is not wider than retirement. (Performance counters for uops_issued.any indicate that un-lamination happens at some point before issue, so that doesn't help us jam more uops through the front-end than retirement can handle. Unless that's misleading; running the same loop on both logical cores of the same physical core should have the same overall bottleneck, but if if Skylake runs it faster, that would tell us that parallel SMT retirement helped. Unlikely, but something to check if anyone wants to rule it out.)


    This is also the throughput that IACA reports

    IACA's pipeline model seems pretty naive; I don't think it knows about Sandybridge's multiple-of-4-uop issue effect (e.g. a 6 uop loop costs the same as 8). IACA also doesn't know that Haswell can keep add eax, [rdi+rdx] micro-fused throughout the pipeline, so any analysis of indexed uops that don't un-laminate is wrong.

    I wouldn't trust IACA to do more than count uops and make some wild guesses about how they will allocate to ports.