performance assembly x86 micro-optimization microbenchmark

Make a register depend on another one without changing its value

Consider the following x86 assembly:

; something that sets rax
mov rcx, [rdi]
xor rax, rcx
xor rax, rcx

At the end of the sequence, rax has the same value as it had on entry, but from the point of view of the CPU its value depends on the value loaded from memory into rcx. In particular, subsequent use of rax won't start until that load and the two xor instructions complete.

Is there any way to achieve this effect more efficiently than the two-xor sequence, e.g., with a single one-uop one-cycle-latency instruction? It's OK if some constant value needs to be set up once before the sequence (e.g., having a zeroed register).

Solution

With only 1 uop / 1c latency on the critical path of the target register:

# target=rax  extra source=rcx
mov  edx, ecx    ; no latency
and  edx, 0      ; BMI1  ANDN could mov+and in 1 uop, port 1 or 5 only on SnB-family (Ryzen: any)

or   rax, rdx

AND with zero is not special-cased as a dep-breaking zeroing idiom on any CPUs, AFAIK.

front-end uops: 3 (or 2 with BMI1). Latency:

from rcx to rax: 2c (with mov-elimination or BMI1).
from rax(input) to rax(output): 1c

With a zeroed register, if it's ok to couple all the dep chains into that one register (unlike the ANDN version that only reads an all-ones register):

and   edx, ecx         # 0 &= ecx
or    rax, rdx         # rax |= 0

To test a function's latency (not throughput) but still feed it the same input repeatedly:

.loop:
    call  func        ; arg in RDI, return in RAX
    mov   rdi, rbx    ; arg for next iter, off the critical path

    and   eax, 0      ; 1c latency
    or    rdi, rax    ; 1c latency

   jmp   .loop

If the function is pure, we can do 1c / 1uop

Actually it just has to return a known value for a given input. This also works if its impurities are limited to having other side-effects / outputs.

Instead of XOR twice after getting the result, set things up so we already have an XOR that we can unscramble with only one more XOR. Or use addition because LEA lets us copy-and-add in one instruction, saving a mov that wouldn't be on the critical path.

    mov   rdi, rbx        ; original input
    call  func
    sub   rbx, rax        ; RBX = input - output

.loop:
    call  func
    lea   rdi, [rbx + rax]   ; RDI = (input-output) + output = input
    jmp  .loop

@RossRidge's suggestion is only 1 uop on SnB-family CPUs, but only runs on port 1:

shld rax, rcx, 0

3c latency, 1 uop for port 1 on HSW/SKL. Agner Fog reports 1c latency for IvB, but 3c for HSW/BDW/SKL.

shld r,r,i is 2 uops on older Intel, and significantly slower on AMD, like 6 uops / 3c latency on Piledriver / Ryzen.

Note that instlatx64 reports 1c latency / 0.5c throughput for shld/shrd on Haswell/Skylake (like single-register shifts), but I tested myself and it's definitely 3c latency / 1c throughput. Reported as an instlatx64 bug on their github page.

SHLD could also be interesting for copying a 32-bit register with a dependency on another. e.g. @BeeOnRope describes wanting to call a function repeatedly with the same input value in RDI, but with a dependency on the result in RAX. If we only care about EDI, then

; RBX = input<<32
call  func
mov   edi, eax         ; 0 latency with mov-elimination
shld  rdi, rbx, 32     ; EDI = the high 32 bits of RBX, high bits of RDI = old EDI.

Of course that's pointless vs. this, which doesn't need mov-elimination

call   func
mov    rdi, rbx        ; off critical path
shld   rdi, rax, 0     ; possibly 1c latency on SnB / IvB.  3 on HSW/SKL

A modification of @DavidWholford's suggestion works, too:

test ecx,ecx     ; CF=0, with a false dependency on RCX
adc  rax, 0      ; dependent on CF

2 uops on Haswell/Broadwell/Skylake and AMD. 3 uops on Intel P6-family, and maybe SnB/IvB. Latency:

from rcx to rax: 2c on HSW and later, 3 with 2-uop adc
from rax to rax: 1c on HSW and later, 2 with 2-uop adc

ADC on Haswell and earlier is normally 2 uops, but adc with immediate 0 is special-cased on Haswell to only be 1 uop / 1c. adc eax,0 is always 2c latency on Core 2. The first uarch with this optimization might be SnB, but hopefully we get an answer on Which Intel microarchitecture introduced the ADC reg,0 single-uop special case?

test clears CF regardless of the value, but I think (untested) that CF still carries a dependency on the source register. If not, then perhaps use TEST / ADOX could be useful on Broadwell and later. (Because CF is renamed separately on most CPUs, but OF might only be part of the same bundle as ZF / SF and other flags that do depend on the AND result.)