Search code examples
assemblyx86x86-64cpu-architecture

x86 uica reports high uop count for ret instruction, does not agree with other sources


(one snippet is used to reduce # of traces linked, the actual throughput is irrelevant, only uop/instruction matters)
this sequence of instructions:

    ret

    ret 4

    pop rcx
    jmp rcx

generates the following trace on haswell uarch (as this is what I have, I will stick to it but similar traces happen on other uarches as well): https://uica.uops.info/tmp/96c206fda86a42b6abcef52bd088ac13_trace.html

why do ret and ret imm16 take so many uops? Especially compared to pop jmp?

this post states that pop jmp and ret are equivalent which is indeed what is observed with benchmarking this code:

section .text

global _start

_start:
    mov   rax, 0xFFFFFFF

.lp:
    call  .tst

    sub   rax, 1
    jge   .lp

    mov   eax, 0x3C
    xor   edi, edi
    syscall

.tst:
    pop   rcx
    jmp   rcx
;   ret

where version with ret takes about 1.3G cycles and version with pop jmp takes about 1.3G cycles too

xml file with instruction info for ret without immiediate for HSW has this

<architecture name="HSW">
[...]
    ports="6*p06+6*p15+2*p23+2*p237+3*p4" uops="20" uops_MITE="4" uops_MS="6" uops_retire_slots="2"/>

which is not quite in the same order as uica shows but otherwise the same

Fog's instruction tables page 250 says ret uses p237 and p6 which is about the same as pop jmp

additionally ports reportedly used by ret are unexpected, in order:

1 * {p23} implies memory read  
6 * {p06 p15} suggests some arithmetic stuff  
3 * {p23(7) p4} implies memory **write**  

I dont recall anything from uica paper but I have not reread the paper very recently, just skimmed through it searching for keywords
I have not checked source code because AFAIK all instruction data is in the XML file mentioned above

so judging by that uica is clearly wrong unless there is a reason I dont know about so the question is, is there such reason?


Solution

  • Normally it's not useful to have ret in a block you're measuring with uiCA; it treats your code as a loop body whether or not it ends with a jcc back to the top, and whether or not it contains unconditional jumps. (If there's no jump, it's as if you were going to unroll the loop body, although for JCC-erratum and uop-cache purposes it just uses the addresses of the one copy, as if there was a jump that executed for free.)


    uiCA uses data from https://uops.info/ measurements. Those measurements don't reflect normal usage of ret paired with call, only using ret alone in a loop as a general indirect jump. ret is hard to micro-benchmark on its own, especially in a benchmark framework that wants to test contiguous snippets of code inside loops, so there's nowhere to put a call to a separate ret. (Except maybe as part of the "setup" code which nanobench allows; it could contain a ret that gets jumped during actual entry to the loop. But uops.info doesn't currently do that; @AndreasAbel might be interested in improving the ret microbenchmarks?)

    I think Agner Fog's measurement on Haswell of ret decoding as 1 micro-fused uop for ports p237 p6 is correct; that matches my measurements on Skylake1. (Except p7 is almost certainly not right, probably a typo or copy/paste error in his spreadsheet. Port 7 is a store-address execution unit only, it can't run loads.)


    The uops.info numbers for ret_near (https://uops.info/html-instr/RET_NEAR.html) on Haswell are

    • 28.91 cycle actual measured throughput, which is very slow, probably from branch mispredicts. That will throw off any uop counts other than retired, since uops on the wrong path still get issued and executed in the shadow of the mispredict. The counters for events like uops_dispatched_port.port_6 don't distinguish between uops that go on to retire vs. those that get discarded. This isn't a problem for microbenchmarking most instructions since the benchmark loop predicts perfectly.
    • 2 front-end uops / 20 back-end uops, for ports 6*p06+6*p15+2*p23+2*p237+3*p4. Obviously way too high; we know ret isn't that slow when used normally (paired with a call).

    uops.info's test loops (e.g. https://uops.info/html-tp/HSW/RET_NEAR-Measurements.html) use RIP-relative LEA to get a "return address" which it stores with mov instead of push.

    # uops.info microbenchmark loop body for ret_near, unroll_count=500
       0:   48 8d 05 05 00 00 00    lea    rax,[rip+0x5]        # 0xc
       7:   48 89 04 24             mov    QWORD PTR [rsp],rax
       b:   c3                      ret  
          lea_target:   # ret jumps here  
    

    Using mov instead of push means the RSP stack doesn't balance. And mixing explicit [rsp] with a stack operation like ret that implicitly modifies RSP means the stack engine will have to insert a stack-sync uop before every mov. This might be the MS (microcode sequencer) uop. This is a true extra cost for this loop, but also gets amplified by mispredicts.


    On Skylake, when the return-predictor stack is empty, ret falls back to normal indirect-branch prediction. But earlier CPUs didn't, I think. (This matters for Spectre mitigation, IIRC making retpolines not 100% reliable if pre-emption happens in the middle of a retpoline or something.) uops.info measured ret throughput at ~10 cycles on SKL, improved from ~29 cycles on Haswell and ~31 on Sandybridge, probably for that reason. Ice Lake is back up to ~31, Alder Lake P-cores are down to 2.17 cycles per ret with the same loop. (Perhaps Ice Lake dropped the fallback to normal indirect branch prediction but Alder Lake brought it back? Or something else is going on, like maybe some branches aliased each other in the predictors in Ice Lake.)

    Anyway, ret trying to use a non-existent prediction from an underflowed predictor-stack could lead to it mispredicting most of the time, with disastrous consequences. The per-port uop breakdowns for HSW and earlier are nonsense on uops.info, but somewhat plausible for SKL and later. Still inflated probably from stack-sync uops and maybe imperfect branch prediction, since there still seems to be a store getting counted until Alder Lake.


    Footnote 1: Skylake measurements of call/ret pairs

    My measurements on Skylake are that a call rel32/ret pair is 3 fused-domain uops, 5 unfused-domain uops that need execution units. uops.info and Agner Fog agree that call rel32 alone is 2 front-end uops that run as 3 unfused-domain uops. (Jump and store-address+store-data). That leaves ret as the expected 1 fused-domain uop: an indirect jump micro-fused with a load. I expect Agner Fog did something like this to measure ret.

    I put one or two call test in a tight loop (dec/jnz) to run under perf stat in a static executable, where test: ret was defined later. All the ALU uops were port 6. Load/store were a mix of ports 2,3,7. (Of course the port 7 uops are only from call)

    https://uops.info/ where measured SKL's ret_near alone at 2 front-end, 7 back-end. (Perhaps the port breakdown is including extra mis-speculated mov stores from their test loop.)

    ; testloop.asm   - static executable for Linux
    default rel
    %use smartalign
    alignmode p6, 64
    
    global _start
    _start:
    
        mov     ebp, 100000000
    align 64
    .loop:
        call test
        call test
        dec ebp
        jnz .loop
    .end:
    
        xor edi,edi
        mov eax,231   ; __NR_exit_group  from /usr/include/asm/unistd_64.h
        syscall       ; sys_exit_group(0)
    
    align 32        ; in a separate 32-byte block to improve uop-cache hit rate, not that it matters much
    test:
      ret
    
        ## on my i7-6700k Skylake, Linux 6.5 / perf 6.5
    $ nasm -felf64 testloop.asm
    $ ld -o testloop testloop.o
    $ taskset -c 1 perf stat --all-user -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,uops_executed.thread,idq.mite_uops,branch-misses  ./testloop
    
     Performance counter stats for './testloop':
    
                282.37 msec task-clock                       #    0.998 CPUs utilized             
                     0      context-switches                 #    0.000 /sec                      
                     0      cpu-migrations                   #    0.000 /sec                      
                     1      page-faults                      #    3.541 /sec                      
         1,100,001,492      cycles                           #    3.896 GHz                       
           600,000,098      instructions                     #    0.55  insn per cycle            
           700,089,165      uops_issued.any                  #    2.479 G/sec                     
         1,100,088,709      uops_executed.thread             #    3.896 G/sec                     
             8,413,976      idq.mite_uops                    #   29.798 M/sec                     
                     8      branch-misses                                                         
    
           0.282836215 seconds time elapsed
    
           0.282460000 seconds user
           0.000000000 seconds sys
    

    Commenting out one of the call instructions reduces the counts as expected, 3 front-end uops per iter, 5 back-end. The dec/jnz is just 1 front-end / 1 back-end uop, for a total of 4 and 6 per iteration.

    ; with only one call in the loop
                120.58 msec task-clock                       #    0.996 CPUs utilized             
           463,605,997      cycles                           #    3.845 GHz                       
           400,000,049      instructions                     #    0.86  insn per cycle            
           400,041,809      uops_issued.any                  #    3.318 G/sec                     
           600,041,674      uops_executed.thread             #    4.976 G/sec                     
           134,054,953      idq.mite_uops                    #    1.112 G/sec                     
    

    With counters for uops_dispatched_port.port_0,uops_dispatched_port.port_1,uops_dispatched_port.port_5,uops_dispatched_port.port_6 and uops_dispatched_port.port_2,uops_dispatched_port.port_3,uops_dispatched_port.port_7 in two separate runs, we can see the distribution:

    # 2 call/ret pairs + dec/jnz
    
                 5,619      uops_dispatched_port.port_0      #   19.750 K/sec                     
                 7,637      uops_dispatched_port.port_1      #   26.843 K/sec                     
                 8,827      uops_dispatched_port.port_5      #   31.025 K/sec                     
           500,099,429      uops_dispatched_port.port_6      #    1.758 G/sec                     
    
           164,323,545      uops_dispatched_port.port_2      #  581.175 M/sec                     
           135,457,372      uops_dispatched_port.port_3      #  479.082 M/sec                     
           100,224,734      uops_dispatched_port.port_7      #  354.472 M/sec                     
    

    I don't know where those thousands of uops for ports 0, 1, and 5 are coming from. None of the instructions this program runs in user-space can run on them. Unless there's some kind of stack-sync thing when interrupts happen.