(one snippet is used to reduce # of traces linked, the actual throughput is irrelevant, only uop/instruction matters)
this sequence of instructions:
ret
ret 4
pop rcx
jmp rcx
generates the following trace on haswell uarch (as this is what I have, I will stick to it but similar traces happen on other uarches as well): https://uica.uops.info/tmp/96c206fda86a42b6abcef52bd088ac13_trace.html
why do ret
and ret imm16
take so many uops? Especially compared to pop jmp
?
this post states that pop jmp
and ret
are equivalent which is indeed what is observed with benchmarking this code:
section .text
global _start
_start:
mov rax, 0xFFFFFFF
.lp:
call .tst
sub rax, 1
jge .lp
mov eax, 0x3C
xor edi, edi
syscall
.tst:
pop rcx
jmp rcx
; ret
where version with ret
takes about 1.3G cycles
and version with pop jmp
takes about 1.3G cycles too
xml file with instruction info for ret
without immiediate for HSW
has this
<architecture name="HSW">
[...]
ports="6*p06+6*p15+2*p23+2*p237+3*p4" uops="20" uops_MITE="4" uops_MS="6" uops_retire_slots="2"/>
which is not quite in the same order as uica shows but otherwise the same
Fog's instruction tables page 250 says ret
uses p237 and p6 which is about the same as pop jmp
additionally ports reportedly used by ret
are unexpected, in order:
1 * {p23} implies memory read
6 * {p06 p15} suggests some arithmetic stuff
3 * {p23(7) p4} implies memory **write**
I dont recall anything from uica paper but I have not reread the paper very recently, just skimmed through it searching for keywords
I have not checked source code because AFAIK all instruction data is in the XML file mentioned above
so judging by that uica is clearly wrong unless there is a reason I dont know about so the question is, is there such reason?
Normally it's not useful to have ret
in a block you're measuring with uiCA; it treats your code as a loop body whether or not it ends with a jcc
back to the top, and whether or not it contains unconditional jumps. (If there's no jump, it's as if you were going to unroll the loop body, although for JCC-erratum and uop-cache purposes it just uses the addresses of the one copy, as if there was a jump that executed for free.)
uiCA uses data from https://uops.info/ measurements. Those measurements don't reflect normal usage of ret
paired with call
, only using ret
alone in a loop as a general indirect jump. ret
is hard to micro-benchmark on its own, especially in a benchmark framework that wants to test contiguous snippets of code inside loops, so there's nowhere to put a call
to a separate ret
. (Except maybe as part of the "setup" code which nanobench allows; it could contain a ret
that gets jumped during actual entry to the loop. But uops.info doesn't currently do that; @AndreasAbel might be interested in improving the ret
microbenchmarks?)
I think Agner Fog's measurement on Haswell of ret
decoding as 1 micro-fused uop for ports p237 p6
is correct; that matches my measurements on Skylake1. (Except p7 is almost certainly not right, probably a typo or copy/paste error in his spreadsheet. Port 7 is a store-address execution unit only, it can't run loads.)
The uops.info numbers for ret_near
(https://uops.info/html-instr/RET_NEAR.html) on Haswell are
uops_dispatched_port.port_6
don't distinguish between uops that go on to retire vs. those that get discarded. This isn't a problem for microbenchmarking most instructions since the benchmark loop predicts perfectly.6*p06+6*p15+2*p23+2*p237+3*p4
. Obviously way too high; we know ret isn't that slow when used normally (paired with a call
).uops.info's test loops (e.g. https://uops.info/html-tp/HSW/RET_NEAR-Measurements.html) use RIP-relative LEA to get a "return address" which it stores with mov
instead of push
.
# uops.info microbenchmark loop body for ret_near, unroll_count=500
0: 48 8d 05 05 00 00 00 lea rax,[rip+0x5] # 0xc
7: 48 89 04 24 mov QWORD PTR [rsp],rax
b: c3 ret
lea_target: # ret jumps here
Using mov
instead of push
means the RSP stack doesn't balance. And mixing explicit [rsp]
with a stack operation like ret
that implicitly modifies RSP means the stack engine will have to insert a stack-sync uop before every mov
. This might be the MS (microcode sequencer) uop. This is a true extra cost for this loop, but also gets amplified by mispredicts.
On Skylake, when the return-predictor stack is empty, ret
falls back to normal indirect-branch prediction. But earlier CPUs didn't, I think. (This matters for Spectre mitigation, IIRC making retpolines not 100% reliable if pre-emption happens in the middle of a retpoline or something.) uops.info measured ret
throughput at ~10 cycles on SKL, improved from ~29 cycles on Haswell and ~31 on Sandybridge, probably for that reason. Ice Lake is back up to ~31, Alder Lake P-cores are down to 2.17 cycles per ret
with the same loop. (Perhaps Ice Lake dropped the fallback to normal indirect branch prediction but Alder Lake brought it back? Or something else is going on, like maybe some branches aliased each other in the predictors in Ice Lake.)
Anyway, ret
trying to use a non-existent prediction from an underflowed predictor-stack could lead to it mispredicting most of the time, with disastrous consequences. The per-port uop breakdowns for HSW and earlier are nonsense on uops.info, but somewhat plausible for SKL and later. Still inflated probably from stack-sync uops and maybe imperfect branch prediction, since there still seems to be a store getting counted until Alder Lake.
call/ret
pairsMy measurements on Skylake are that a call rel32
/ret
pair is 3 fused-domain uops, 5 unfused-domain uops that need execution units. uops.info and Agner Fog agree that call rel32
alone is 2 front-end uops that run as 3 unfused-domain uops. (Jump and store-address+store-data). That leaves ret
as the expected 1 fused-domain uop: an indirect jump micro-fused with a load. I expect Agner Fog did something like this to measure ret
.
I put one or two call test
in a tight loop (dec/jnz) to run under perf stat
in a static executable, where test: ret
was defined later. All the ALU uops were port 6. Load/store were a mix of ports 2,3,7. (Of course the port 7 uops are only from call
)
https://uops.info/ where measured SKL's ret_near
alone at 2 front-end, 7 back-end. (Perhaps the port breakdown is including extra mis-speculated mov
stores from their test loop.)
; testloop.asm - static executable for Linux
default rel
%use smartalign
alignmode p6, 64
global _start
_start:
mov ebp, 100000000
align 64
.loop:
call test
call test
dec ebp
jnz .loop
.end:
xor edi,edi
mov eax,231 ; __NR_exit_group from /usr/include/asm/unistd_64.h
syscall ; sys_exit_group(0)
align 32 ; in a separate 32-byte block to improve uop-cache hit rate, not that it matters much
test:
ret
## on my i7-6700k Skylake, Linux 6.5 / perf 6.5
$ nasm -felf64 testloop.asm
$ ld -o testloop testloop.o
$ taskset -c 1 perf stat --all-user -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,uops_executed.thread,idq.mite_uops,branch-misses ./testloop
Performance counter stats for './testloop':
282.37 msec task-clock # 0.998 CPUs utilized
0 context-switches # 0.000 /sec
0 cpu-migrations # 0.000 /sec
1 page-faults # 3.541 /sec
1,100,001,492 cycles # 3.896 GHz
600,000,098 instructions # 0.55 insn per cycle
700,089,165 uops_issued.any # 2.479 G/sec
1,100,088,709 uops_executed.thread # 3.896 G/sec
8,413,976 idq.mite_uops # 29.798 M/sec
8 branch-misses
0.282836215 seconds time elapsed
0.282460000 seconds user
0.000000000 seconds sys
Commenting out one of the call
instructions reduces the counts as expected, 3 front-end uops per iter, 5 back-end. The dec/jnz is just 1 front-end / 1 back-end uop, for a total of 4 and 6 per iteration.
; with only one call in the loop
120.58 msec task-clock # 0.996 CPUs utilized
463,605,997 cycles # 3.845 GHz
400,000,049 instructions # 0.86 insn per cycle
400,041,809 uops_issued.any # 3.318 G/sec
600,041,674 uops_executed.thread # 4.976 G/sec
134,054,953 idq.mite_uops # 1.112 G/sec
With counters for uops_dispatched_port.port_0,uops_dispatched_port.port_1,uops_dispatched_port.port_5,uops_dispatched_port.port_6
and uops_dispatched_port.port_2,uops_dispatched_port.port_3,uops_dispatched_port.port_7
in two separate runs, we can see the distribution:
# 2 call/ret pairs + dec/jnz
5,619 uops_dispatched_port.port_0 # 19.750 K/sec
7,637 uops_dispatched_port.port_1 # 26.843 K/sec
8,827 uops_dispatched_port.port_5 # 31.025 K/sec
500,099,429 uops_dispatched_port.port_6 # 1.758 G/sec
164,323,545 uops_dispatched_port.port_2 # 581.175 M/sec
135,457,372 uops_dispatched_port.port_3 # 479.082 M/sec
100,224,734 uops_dispatched_port.port_7 # 354.472 M/sec
I don't know where those thousands of uops for ports 0, 1, and 5 are coming from. None of the instructions this program runs in user-space can run on them. Unless there's some kind of stack-sync thing when interrupts happen.