What is instruction fusion in contemporary x86 processors?

What I understand is, there are two types of instruction fusions:

Micro-operation fusion
Macro-operation fusion

Micro-operations are those operations that can be executed in 1 clock cycle. If several micro-operations are fused, we obtain an "instruction".

If several instructions are fused, we obtain a Macro-operation.

If several macro-operations are fused, we obtain Macro-operation fusing.

Am I correct?

Solution

No, fusion is totally separate from how one complex instruction (like cpuid or lock add [mem], eax) can decode to multiple uops. Most instructions decode to a single uop, so that's the normal case in modern x86 CPUs.

The back-end has to keep track of all uops associated with an instruction, whether or not there was any micro-fusion or macro-fusion. When all the uops for a single instruction have retired from the ROB, the instruction has retired. (Interrupts can only be taken at instruction boundaries, so if one is pending, retirement has to find an instruction boundary for that, not in the middle of a multi-uop instruction. Otherwise retire slots can be filled without regard to instruction boundaries, like issue slots.)

Macro-fusion - between instructions

Macro-fusion decodes cmp/jcc or test/jcc into a single compare-and-branch uop. (Intel and AMD CPUs). The rest of the pipeline sees it purely as a single uop¹ (except performance counters still count it as 2 instructions). This saves uop cache space, and bandwidth everywhere including decode. In some code, compare-and-branch makes up a significant fraction of the total instruction mix, like maybe 25%, so choosing to look for this fusion rather than other possible fusions like mov dst,src1 / or dst,src2 makes sense.

Sandybridge-family can also macro-fuse some other ALU instructions with conditional branches, like add/sub or inc/dec + JCC with some conditions. (x86_64 - Assembly - loop conditions and out of order)

Ice Lake² changed to doing macro-fusion right after legacy decode, so pre-decode only has to steer 1 x86 instruction to each of the four decoders.

Micro-fusion - within 1 instruction

Micro-fusion stores 2 uops from the same instruction together so they only take up 1 "slot" in the fused-domain parts of the pipeline. But they still have to dispatch separately to separate execution units. And in Intel Sandybridge-family, the RS (Reservation Station aka scheduler) is in the unfused domain, so they're even stored separately in the scheduler. (See Footnote 2 in my answer on Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths)

P6 family had a fused-domain RS, as well as ROB, so micro-fusion helped increase the effective size of the out-of-order window there. But SnB-family reportedly simplified the uop format making it more compact, allowing larger RS sizes that are helpful all the time, not just for micro-fused instructions.

And Sandybridge family will "un-laminate" indexed addressing modes under some conditions, splitting them back into 2 separate uops in their own slots before issue/rename into the ROB in the out-of-order back end, so you lose the front-end issue/rename throughput benefit of micro-fusion. See Micro fusion and addressing modes

Both can happen at the same time

    cmp   [rdi], eax
    jnz   .target

Tested on i7-6700k Skylake, probably applicable to most earlier and later Sandybridge-family CPUs, especially before Ice Lake.

The cmp/jcc can macro-fuse into a single cmp-and-branch ALU uop, and the load from [rdi] can micro-fuse with that uop.

Failure to micro-fuse the cmp does not prevent macro-fusion.

The limitations here are: RIP-relative + immediate can never micro-fuse, so cmp dword [static_data], 1 / jnz can macro-fuse but not micro-fuse.

A cmp/jcc on SnB-family (like cmp [rdi+rax], edx / jnz) will macro and micro-fuse in the decoders, but the micro-fusion will un-laminate before the issue stage. (So it's 2 total uops in both the fused-domain and unfused-domain: load with an indexed addressing mode, and ALU cmp/jnz). You can verify this with perf counters by putting a mov ecx, 1 in between the CMP and JCC vs. after, and note that uops_issued.any:u and uops_executed.thread both go up by 1 per loop iteration because we defeated macro-fusion. And micro-fusion behaved the same.

On Skylake, cmp dword [rdi], 0/jnz can't macro-fuse. (Only micro-fuse). I tested with a loop that contained some dummy mov ecx,1 instructions. Reordering so one of those mov instructions split up the cmp/jcc didn't change perf counters for fused-domain or unfused-domain uops.

But cmp [rdi],eax/jnz does macro- and micro-fuse. Reordering so a mov ecx,1 instruction separates CMP from JNZ does change perf counters (proving macro-fusion), and uops_executed is higher than uops_issued by 1 per iteration (proving micro-fusion).

cmp [rdi+rax], eax/jne only macro-fuses; not micro. (Well actually micro-fuses in decode but un-laminates before issue because of the indexed addressing mode, and it's not an RMW-register destination like sub eax, [rdi+rax] that can keep indexed addressing modes micro-fused. That sub with an indexed addressing mode does macro- and micro-fuse on SKL, and presumably Haswell).

(The cmp dword [rdi],0 does micro-fuse, though: uops_issued.any:u is 1 lower than uops_executed.thread, and the loop contains no nop or other "eliminated" instructions, or any other memory instructions that could micro-fuse).

Some compilers (including GCC IIRC) prefer to use a separate load instruction and then compare+branch on a register. TODO: check whether gcc and clang's choices are optimal with immediate vs. register.

Micro-operations are those operations that can be executed in 1 clock cycle.

Not exactly. They take 1 "slot" in the pipeline, or in the ROB and RS that track them in the out-of-order back-end.

And yes, dispatching a uop to an execution port happens in 1 clock cycle and simple uops (e.g., integer addition) can complete execution in the same cycle. This can happen for up to 8 uops simultaneously since Haswell, but increased to 10 on Sunny Cove. The actual execution might take more than 1 clock cycle (occupying the execution unit for longer, e.g. FP division).

The divider is I think the only execution unit on modern mainstream Intel that's not fully pipelined, but Knight's Landing has some not-fully-pipelined SIMD shuffles that are single uop but (reciprocal) throughput of 2 cycles.).

Footnote 1 - does a macro-fused uop ever need to split?

If cmp [rdi], eax / jne faults on the memory operand, i.e. a #PF page fault exception, it's taken with the exception return address pointing to the start of the cmp, so it can re-run after the OS pages in the page. That Just Works whether we have fusion or not, nothing surprising.

Or if the branch target address is an unmapped page, a #PF exception will happen after the branch has already executed, from code fetch with an updated RIP.

But if the branch target address is non-canonical, architecturally the jcc itself should #GP fault. e.g. if RIP was near the top of the canonical range, and rel32=+almost2GiB. (x86-64 is designed so RIP values can literally be 48-bit or 57-bit internally, never needing to hold a non-canonical address, since a fault happens on trying to set it, not waiting until code-fetch from the non-canonical address.)

If CPUs handle that with an exception on the jcc, not the cmp, then sorting that out can be deferred until the exception is actually detected. Maybe with a microcode assist, or some special-case hardware.

Also, single-stepping with TF=1 should stop after the cmp.

As far as how the cmp/jcc uop goes through the pipeline in the normal case, it works exactly like one long single-uop instruction that both sets flags and conditionally branches.

Surprisingly, the loop instruction (like dec rcx/jnz but without setting flags) is not a single uop on Intel CPUs. Why is the loop instruction slow? Couldn't Intel have implemented it efficiently?.

Footnote 2: Ice Lake changes

Agner Fog found that macro-fusion happens after the legacy decoders. (Micro-fusion of course still in the decoders so instructions like add eax, [rdi] can still decode in a "simple" decoder.)

Hopefully the upside here is not ending a decode group early if the last instruction is one that could maybe macro-fuse, which is IIRC something earlier CPUs do. (Lower legacy-decode throughput for a big unrolled block of sub instructions vs. or instructions when no JCC is involved. Earlier CPUs couldn't macro-fuse or with anything. This only affected legacy decode, not the uop cache.)

Wikichip incorrectly reports that ICL can only make one macro-fusion per clock cycle, but testing on Can two fuseable pairs be decoded in the same clock cycle? confirms Rocket Lake (same uarch backported to 14nm) can still do 2/clock like Haswell and Skylake.

One source reports that Ice Lake can't macro-fuse inc or dec/jcc (or any instruction with a memory operand), but Agner Fog's table disagrees. uiCA shows dec/jnz at the bottom of a loop macro-fusing, and their paper shows its predictions agree well with testing on real CPUs including on ICL. But if they compiled with recent GCC, they might not have tested any dec/jcc loops, sub/jcc. Agner's ICL fusion table isn't a copy/paste of earlier SnB; it shows inc/dec can fuse in the same cases as add/sub now (which surprisingly includes with jc/ja now, but dec doesn't modify CF.) If anyone could test this to verify, that'd be great.

update: Noah's testing on a Tiger Lake shows dec/jnz at the bottom of a loop can macro-fuse. And that dec/jc doesn't appear to macro-fuse.

Microcode version: 0x42. decl; jnz loop still macrofuses (niters = nissued_uops = nexecuted_uops = cycles = {expected_ports}).

Couldn't get decl; jc to macrofuse. For decl; jc I setup two loops: subl $1, %ecx; decl %eax; jc loop (where ecx was a loop counter). niters * 3 uops issued/executed.
Also tried just carry-flag unset and decl %eax; jc done; jnz loop, also 3 * niters uops.

It's likely that Ice Lake behaves the same as Tiger Lake, since it didn't make major microarchitectural changes.