performance x86 intel cpu-architecture micro-optimization

Why is POP slow when using register R12?

On recent Intel CPUs, the POP instruction usually has a throughput of 2 instructions per cycle. However, when using register R12 (or RSP, which has the same encoding except for the prefix), the throughput drops to 1 per cycle if the instructions go through the legacy decoders (the throughput stays at around 2 per cycle if the µops come from the DSB).

This can be reproduced using nanoBench as follows:

sudo ./nanoBench.sh -asm "pop R12"

Further experiments on a Haswell machine show the following: When adding between 1 and 4 nops,

sudo ./nanoBench.sh -asm "pop R12; nop;"
sudo ./nanoBench.sh -asm "pop R12; nop; nop;"
sudo ./nanoBench.sh -asm "pop R12; nop; nop; nop;"
sudo ./nanoBench.sh -asm "pop R12; nop; nop; nop; nop;"

the execution time increases to 2 cycles. When adding a 5th nop,

sudo ./nanoBench.sh -asm "pop R12; nop; nop; nop; nop; nop;"

the execution time increases to 3 cycles. This suggests that no other instruction can be decoded in the same cycle as a pop R12 instruction. (When using a different register, e.g., R11, the last example needs 1.5 cycles.)

On Skylake, the execution time stays at 1 cycle when adding between 1 and 3 nops, and increases to 2 for between 4 and 7 nops. This suggests that pop R12 is an instruction that requires the complex decoder, even though it has just one µop (see also Can the simple decoders in recent Intel microarchitectures handle all 1-µop instructions?)

Why is the POP instruction decoded differently when using register R12? Are there any other instructions for which this is also the case?

Solution

Workaround: the pop r/m64 encoding of pop r12 doesn't have this decode penalty. (Thanks @Andreas for testing my guess.)

db  0x41, 0x8f, 0xc4        ; REX.B=1  8F /0  pop r/m64  = pop r12

The standard encoding of pop r12 has the same opcode byte as pop rsp, differing only by a REX. (The short form encoding puts the register number in the low 3 bits of that 1 byte).

pop rsp is special cased even in the decoders; on Haswell it's 3 uops¹ so clearly only the complex decoder can decode it. pop r12 also getting penalized makes sense if the primary filtering of which decoder can decode which instruction is by the opcode byte (not accounting for prefixes), at least for this group of opcodes. Whether this really reflects the exact internals, it's at least a useful mental model to understand why pop modrm doesn't have this effect. (Although normally you'd only use pop r/m64 with a memory destination, which would mean multi-uop and thus complex-decoder only.) See Can the simple decoders in recent Intel microarchitectures handle all 1-µop instructions? for more about this effect for other opcodes.

push rsp is 2 total uops on Haswell, unlike most push reg instructions being 1 uop. But likely that extra uop is just a stack-sync inserted during issue/rename (because of reading RSP), not during decode. @Andreas reports that push rsp and push r12 both show no special effects in the decoder (and I assume uop cache). Just 1 micro-fused uop, with/without a stack-sync uop when it executes.

Opcodes like FF /0 inc r/m32 where the same leading byte is shared between different instructions (overloading the modrm /r field as extra opcode bytes) might be interesting to check on, if there are some single-uop instructions that share a leading byte with multi-uop instructions. Like maybe C0 /4 SHL r/m8,imm8 vs. C0 /2 RCL r/m8, imm8. http://ref.x86asm.net/coder64.html. But SHL with a memory destination can already be multiple uops, so it might get optimistically attempted by the simple decoders anyway, and succeed if it turns out to be single-uop? While perhaps pop r12 bails out early in the simple decoders instead of detecting the REX prefix.

It would make sense for Intel to spend the transistors to make sure common instructions like immediate shifts can decode efficiently, moreso than for less-common instructions like pop r12 which you'll normally only find in function epilogues, and thus usually not in inner loop. Only larger loops that include function calls.

Footnote 1: pop rsp is special because it's just mov rsp, [rsp]. (Or as the manual puts it, The POP ESP instruction increments the stack pointer (ESP) before data at the old top of stack is written into the destination. Haswell's 3-uop implementation seems unnecessary vs. literally the same 1 uop as mov rsp, [rsp] (I think the fault conditions are identical), but this might have saved transistors in the decoders by adding a uop to the normal way pop reg decodes (possibly implicitly requiring a stack-sync uop for a total of 3), instead of treating it as a whole separate instruction? pop rsp is very rarely used so its performance doesn't matter.

Perhaps the 16-bit pop sp case was a problem for decoding that byte as 1 pure-load uop? There is no [sp] addressing mode in x86 machine code, and it's possible that limitation extends to internal uops for 16-bit AGU. Other than that, I think the possible fault reasons are the same for pop and mov.

pop r12 (short form) does eventually decode to the normal 1 uop, with no more stack-sync uops than for repeated pop of other registers, as per @Andreas's testing. It gets penalized by not being decodeable in the simple decoders, but not by any extra uops that pop rsp specifically decodes to.

Perhaps GAS, NASM, and other assemblers should get a patch to make it possible to encode pop r12 with the modrm encoding, although probably not defaulting to that. Decoder throughput is often not a problem so spending an extra byte of code-size by default would be undesirable. Especially if there's no effect on other uarches, like AMD or Silvermont-family.

And/or GCC should use R12 as its last choice of call-preserved reg to save/restore? (R12 always needs a SIB byte when used as the base in an addressing mode, too, so that's another reason to avoid it, if compilers aren't going to try to avoid keeping pointers in it.) And maybe schedule the push/pop of r12 for efficient decoding, with 3 other pops (or other single-uop isns) after it before multi-uop ret.