Search code examples
performanceassemblyx86micro-optimizationmicrobenchmark

Make a register depend on another one without changing its value


Consider the following x86 assembly:

; something that sets rax
mov rcx, [rdi]
xor rax, rcx
xor rax, rcx

At the end of the sequence, rax has the same value as it had on entry, but from the point of view of the CPU its value depends on the value loaded from memory into rcx. In particular, subsequent use of rax won't start until that load and the two xor instructions complete.

Is there any way to achieve this effect more efficiently than the two-xor sequence, e.g., with a single one-uop one-cycle-latency instruction? It's OK if some constant value needs to be set up once before the sequence (e.g., having a zeroed register).


Solution

  • With only 1 uop / 1c latency on the critical path of the target register:

    # target=rax  extra source=rcx
    mov  edx, ecx    ; no latency
    and  edx, 0      ; BMI1  ANDN could mov+and in 1 uop, port 1 or 5 only on SnB-family (Ryzen: any)
    
    or   rax, rdx
    

    AND with zero is not special-cased as a dep-breaking zeroing idiom on any CPUs, AFAIK.

    front-end uops: 3 (or 2 with BMI1). Latency:

    • from rcx to rax: 2c (with mov-elimination or BMI1).
    • from rax(input) to rax(output): 1c

    With a zeroed register, if it's ok to couple all the dep chains into that one register (unlike the ANDN version that only reads an all-ones register):

    and   edx, ecx         # 0 &= ecx
    or    rax, rdx         # rax |= 0
    

    To test a function's latency (not throughput) but still feed it the same input repeatedly:

    .loop:
        call  func        ; arg in RDI, return in RAX
        mov   rdi, rbx    ; arg for next iter, off the critical path
    
        and   eax, 0      ; 1c latency
        or    rdi, rax    ; 1c latency
    
       jmp   .loop
    

    If the function is pure, we can do 1c / 1uop

    Actually it just has to return a known value for a given input. This also works if its impurities are limited to having other side-effects / outputs.

    Instead of XOR twice after getting the result, set things up so we already have an XOR that we can unscramble with only one more XOR. Or use addition because LEA lets us copy-and-add in one instruction, saving a mov that wouldn't be on the critical path.

        mov   rdi, rbx        ; original input
        call  func
        sub   rbx, rax        ; RBX = input - output
    
    .loop:
        call  func
        lea   rdi, [rbx + rax]   ; RDI = (input-output) + output = input
        jmp  .loop
    

    @RossRidge's suggestion is only 1 uop on SnB-family CPUs, but only runs on port 1:

    shld rax, rcx, 0
    

    3c latency, 1 uop for port 1 on HSW/SKL. Agner Fog reports 1c latency for IvB, but 3c for HSW/BDW/SKL.

    shld r,r,i is 2 uops on older Intel, and significantly slower on AMD, like 6 uops / 3c latency on Piledriver / Ryzen.

    Note that instlatx64 reports 1c latency / 0.5c throughput for shld/shrd on Haswell/Skylake (like single-register shifts), but I tested myself and it's definitely 3c latency / 1c throughput. Reported as an instlatx64 bug on their github page.

    SHLD could also be interesting for copying a 32-bit register with a dependency on another. e.g. @BeeOnRope describes wanting to call a function repeatedly with the same input value in RDI, but with a dependency on the result in RAX. If we only care about EDI, then

    ; RBX = input<<32
    call  func
    mov   edi, eax         ; 0 latency with mov-elimination
    shld  rdi, rbx, 32     ; EDI = the high 32 bits of RBX, high bits of RDI = old EDI.
    

    Of course that's pointless vs. this, which doesn't need mov-elimination

    call   func
    mov    rdi, rbx        ; off critical path
    shld   rdi, rax, 0     ; possibly 1c latency on SnB / IvB.  3 on HSW/SKL
    

    A modification of @DavidWholford's suggestion works, too:

    test ecx,ecx     ; CF=0, with a false dependency on RCX
    adc  rax, 0      ; dependent on CF
    

    2 uops on Haswell/Broadwell/Skylake and AMD. 3 uops on Intel P6-family, and maybe SnB/IvB. Latency:

    • from rcx to rax: 2c on HSW and later, 3 with 2-uop adc
    • from rax to rax: 1c on HSW and later, 2 with 2-uop adc

    ADC on Haswell and earlier is normally 2 uops, but adc with immediate 0 is special-cased on Haswell to only be 1 uop / 1c. adc eax,0 is always 2c latency on Core 2. The first uarch with this optimization might be SnB, but hopefully we get an answer on Which Intel microarchitecture introduced the ADC reg,0 single-uop special case?

    test clears CF regardless of the value, but I think (untested) that CF still carries a dependency on the source register. If not, then perhaps use TEST / ADOX could be useful on Broadwell and later. (Because CF is renamed separately on most CPUs, but OF might only be part of the same bundle as ZF / SF and other flags that do depend on the AND result.)