On recent Intel CPUs, the POP
instruction usually has a throughput of 2 instructions per cycle. However, when using register R12
(or RSP
, which has the same encoding except for the prefix), the throughput drops to 1 per cycle if the instructions go through the legacy decoders (the throughput stays at around 2 per cycle if the µops come from the DSB).
This can be reproduced using nanoBench as follows:
sudo ./nanoBench.sh -asm "pop R12"
Further experiments on a Haswell machine show the following: When adding between 1 and 4 nops
,
sudo ./nanoBench.sh -asm "pop R12; nop;"
sudo ./nanoBench.sh -asm "pop R12; nop; nop;"
sudo ./nanoBench.sh -asm "pop R12; nop; nop; nop;"
sudo ./nanoBench.sh -asm "pop R12; nop; nop; nop; nop;"
the execution time increases to 2 cycles. When adding a 5th nop
,
sudo ./nanoBench.sh -asm "pop R12; nop; nop; nop; nop; nop;"
the execution time increases to 3 cycles. This suggests that no other instruction can be decoded in the same cycle as a pop R12
instruction. (When using a different register, e.g., R11
, the last example needs 1.5 cycles.)
On Skylake, the execution time stays at 1 cycle when adding between 1 and 3 nops
, and increases to 2 for between 4 and 7 nops
. This suggests that pop R12
is an instruction that requires the complex decoder, even though it has just one µop (see also Can the simple decoders in recent Intel microarchitectures handle all 1-µop instructions?)
Why is the POP
instruction decoded differently when using register R12
? Are there any other instructions for which this is also the case?
Workaround: the pop r/m64
encoding of pop r12
doesn't have this decode penalty. (Thanks @Andreas for testing my guess.)
db 0x41, 0x8f, 0xc4 ; REX.B=1 8F /0 pop r/m64 = pop r12
The standard encoding of pop r12
has the same opcode byte as pop rsp
, differing only by a REX. (The short form encoding puts the register number in the low 3 bits of that 1 byte).
pop rsp
is special cased even in the decoders; on Haswell it's 3 uops1 so clearly only the complex decoder can decode it. pop r12
also getting penalized makes sense if the primary filtering of which decoder can decode which instruction is by the opcode byte (not accounting for prefixes), at least for this group of opcodes. Whether this really reflects the exact internals, it's at least a useful mental model to understand why pop modrm doesn't have this effect. (Although normally you'd only use pop r/m64
with a memory destination, which would mean multi-uop and thus complex-decoder only.) See Can the simple decoders in recent Intel microarchitectures handle all 1-µop instructions? for more about this effect for other opcodes.
push rsp
is 2 total uops on Haswell, unlike most push reg
instructions being 1 uop. But likely that extra uop is just a stack-sync inserted during issue/rename (because of reading RSP), not during decode. @Andreas reports that push rsp
and push r12
both show no special effects in the decoder (and I assume uop cache). Just 1 micro-fused uop, with/without a stack-sync uop when it executes.
Opcodes like FF /0 inc r/m32
where the same leading byte is shared between different instructions (overloading the modrm /r
field as extra opcode bytes) might be interesting to check on, if there are some single-uop instructions that share a leading byte with multi-uop instructions. Like maybe C0 /4
SHL r/m8,imm8 vs. C0 /2
RCL r/m8, imm8. http://ref.x86asm.net/coder64.html. But SHL with a memory destination can already be multiple uops, so it might get optimistically attempted by the simple decoders anyway, and succeed if it turns out to be single-uop? While perhaps pop r12
bails out early in the simple decoders instead of detecting the REX prefix.
It would make sense for Intel to spend the transistors to make sure common instructions like immediate shifts can decode efficiently, moreso than for less-common instructions like pop r12
which you'll normally only find in function epilogues, and thus usually not in inner loop. Only larger loops that include function calls.
Footnote 1: pop rsp
is special because it's just mov rsp, [rsp]
. (Or as the manual puts it, The POP ESP instruction increments the stack pointer (ESP) before data at the old top of stack is written into the destination. Haswell's 3-uop implementation seems unnecessary vs. literally the same 1 uop as mov rsp, [rsp]
(I think the fault conditions are identical), but this might have saved transistors in the decoders by adding a uop to the normal way pop reg
decodes (possibly implicitly requiring a stack-sync uop for a total of 3), instead of treating it as a whole separate instruction? pop rsp
is very rarely used so its performance doesn't matter.
Perhaps the 16-bit pop sp
case was a problem for decoding that byte as 1 pure-load uop? There is no [sp]
addressing mode in x86 machine code, and it's possible that limitation extends to internal uops for 16-bit AGU. Other than that, I think the possible fault reasons are the same for pop
and mov
.
pop r12
(short form) does eventually decode to the normal 1 uop, with no more stack-sync uops than for repeated pop of other registers, as per @Andreas's testing. It gets penalized by not being decodeable in the simple decoders, but not by any extra uops that pop rsp
specifically decodes to.
Perhaps GAS, NASM, and other assemblers should get a patch to make it possible to encode pop r12
with the modrm encoding, although probably not defaulting to that. Decoder throughput is often not a problem so spending an extra byte of code-size by default would be undesirable. Especially if there's no effect on other uarches, like AMD or Silvermont-family.
And/or GCC should use R12 as its last choice of call-preserved reg to save/restore? (R12 always needs a SIB byte when used as the base in an addressing mode, too, so that's another reason to avoid it, if compilers aren't going to try to avoid keeping pointers in it.) And maybe schedule the push/pop of r12 for efficient decoding, with 3 other pops (or other single-uop isns) after it before multi-uop ret
.