How do move elimination slots work in Intel CPU?

Andreas Abel and Jan Reineke discuss move elimination in their paper describing uiCA:

4.1.4 Move Elimination. [...] However, this move elimination is not always successful. [...] We have developed microbenchmarks that use these counters to analyze when move elimination is successful. [...]
The following model agrees with our observations. The processor keeps track of the physical registers that are used by more than one architectural register. We say that each such physical register occupies one elimination slot. An elimination slot is released again after the corresponding registers have been overwritten.* The number of move instructions that can be eliminated in a cycle depends both on the number of available elimination slots, and on the number of successful eliminations in the previous cycle.

Where I've added emphasis on the part I don't understand.

I thought that a given physical register could be used from rename to retire only by a single architectural register. I took the meaning of the text to imply otherwise and so I'm struggling to understand how move elimination slots work (and at this point even how register renaming actually works).

Solution

The whole point of mov-elimination is that instead of allocating a new PRF entry (physical register file) and running a uop to read the value and write it to that new entry (like lea rdx, [rcx+0] would on CPUs before Alder Lake P-cores aka Golden Cove), mov rdx, rcx can be handled by having the RAT entry (register allocation table) for RDX point to the same physical register number as RCX does at that point.

So the whole idea is to bend the rule of a PRF entry being the state of a single architectural register at some point. This presumably makes it more complicated to track when a PRF entry can be freed, or for renaming later uops when two architectural registers both refer to the same physical reg, or some other complication.

"Move-elimination slots" are a separate resource, not PRF entries. They exist to solve whatever extra tracking problem Intel ran into. A move-elimination slot is freed when you overwrite the destination of the mov again later, e.g. mov ecx, edx / not ecx immediately releases whatever mov-elimination resources were needed.

Without mov-elimination, you're right about how it works; one PRF entry holds the value written to only one architectural register, and is an input dependency for any uops that read that register before it's overwritten.

Except that a PRF entry also has room for FLAGS condition codes, so after an instruction like add eax, ecx that writes both FLAGS and an integer reg, both RFLAGS and RCX point to the same physical reg. A later instruction like mov-immediate, not or lea can overwrite the gp register and leave just CF and the SPAZO group of FLAGS pointing to the old physical reg. Instructions like cmp, stc, or add [mem], eax write (part of) FLAGS but not an integer register.

But that's just two things (the separately-renamed parts of FLAGS, CF and SF/PF/AF/ZF/OF aka SPAZO) which can maybe still refer to a phys reg, other than a GP-integer register. With maybe 1 bit per phys reg to track whether it's still referenced by a GP-integer reg, retirement can free them correctly when retiring a uop that writes a GP-integer register, with maybe just a check against the retirement state of the RAT entries for FLAGS. Or maybe each PRF-entry has 3 bits, one each for GP-integer, CF, and SPAZO, as a way for retirement to figure out when it can free a physical register (when it retires a uop that overwrites the last architectural reference to it.)

BeeOnRope suggests that instead of full reference-counting in every PRF entry (with counters that could count up to 15 in case of mov ecx, eax / mov edx, eax / ...), the move-elimination slots effectively are reference counts.

xor-zeroing can always be eliminated because the physical zero-register never needs to be freed, so it doesn't need to be reference counted. (The existence of a physical zero-register for integer and vector is inferred from the fact that SnB-family is able to eliminate zeroing idioms as no uops.)

Related: Can x86's MOV really be "free"? Why can't I reproduce this at all? which mentions some of what Intel's optimization manual says about preferring to overwrite the result of a register copy soon, to increase the success rate of mov-elimination. But Intel at least at that time didn't mention the details of what CPU resource limit was involved.

Skylake has more mov-elimination slots than Ivy Bridge, since my testing shows it doesn't run into a bottleneck in the test-case they used to illustrate the benefit of overwriting the mov promptly.

It's really unfortunate that Intel screwed up Ice Lake / Tiger Lake and had to disable its mov-elimination (for GP-integer) with a microcode update, since overwriting the mov right away usually means it's part of the critical path latency, the opposite of what you want if you code might run on a CPU without mov-elimination. It's working again in Alder Lake and Rocket Lake.

In many cases you will overwrite both the copy and the original soonish, so it's fine to leave the destination unmodified across a few instructions. Ideally avoid leaving the copy unmodified long-term, unless it would cost more uops or make the critical path latencies worse on Ice Lake. (e.g. if you save a copy and only ever read it.) The next interrupt will usually lead to all regs getting saved/restored anyway so this isn't a problem that can "build up" even for code that has a few long-running loops with many mov-eliminated copies.

How do *move elimination* slots work in Intel CPU?

How do move elimination slots work in Intel CPU?