Search code examples
assemblyx86bit-manipulationmicro-optimization

Copy bit of one register to another register (x86-64 asm)


as part of a project that generates x86-64 machine code at runtime, I very often have the need to copy a bit from one register to another register at another bit position.

I came up with this code (example that copies bit 23 of a source register to bit 3 of a destination):

bt eax, 23           ; test bit 23 of eax (source)
setc ebx             ; copy result to ebx (ebx is guaranteed to be zero before)
shl ebx, 3           ; shift up to become our target bit 3
and ecx, 0xFFFFFFF7  ; remove bit 3 from ecx (target)
or ecx, ebx          ; set new bit value

Given that I need five instructions just to copy one bit of one register to another bit of another register, I thought if there is something that uses less instructions on x86?

I've read a bit on BMI instructions but unfortunately they do not offer bit extraction using immediates.


Solution

  • Number of instructions is not a good performance metric. Either code-size in bytes (x86 machine instructions are variable length), or for how existing mainstream CPUs execute them: number of front-end uops (after decoding) and/or latency / throughput are relevant optimization goals, with the important one depending on the surrounding code for something this small. (What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?)

    rcr / rcl are unfortunately very slow, especially with count != 1.
    Four single-uop instructions are much faster, including a BMI2 rorx to copy-and-rotate the bit you want to insert to the right position in a tmp register, then and to isolate it. Or a normal shift/and if you don't need to preserve the input. That's more efficient than anything I've thought of bouncing it through CF.

    This is more efficient than xor-zero/bt/setc/shl. It also avoids partial-register false dependencies or stalls: setc ebx doesn't exist, only setc bl (or setc bh). It also means that if you can destroy your input register instead of using a temporary, you don't need anything inefficient like setc bl / movzx ebx, bl which would put the zero-extension on the critical path for latency, and defeat mov-elimination.

    I switched to EDX as a temporary because it's call-clobbered in normal calling conventions.

    ; input in EAX, merge-target in ECX (read/write)
    ; no pre-conditions necessary
    ;  unlike the original which doesn't count the cost of zeroing EDX
    
       rorx  edx, eax, 23-3       ; shift bit 23 to bit 3.  Use SHR if you don't need the EAX value later
       and   edx, 1<<3            ; and isolate
       and   ecx, 0xFFFFFFF7      ; clear it in the destination
       or    ecx, edx             ; and merge
    
    ; total size: 14 bytes of machine code for imm8 masks, 20 for imm32 masks
    ; 4 uops.
    

    The or could instead be an add or lea if you want because we know there are no overlapping 1 bits. lea being useful as a copy-and-merge in case you want the result in a different register. But if you want that, you'd just or into the temp reg instead of ECX, and you can choose any temp reg you want, including EAX (in which case you could optimize rorx to shr.) add would be useful in case you want to branch on FLAGS from it, since it can macro-fuse with some forms of jcc on Sandybridge-family. xor would also work but has no advantages, and isn't idiomatic for human readers.

    lea can be useful to allow just a shr instead of a longer rorx if you don't mind destroying the input. 2-register LEA is 1 byte longer than add or or, but is fast on most CPUs when there's no shift count (AMD) and also no constant displacement. (It can't run on as many ports on Intel before Ice Lake, so use add or or if all else is equal, i.e. when you can't save any insns or latency or whatever with lea.) clang makes good use of it with -O3 (just tune=generic, no -march): https://godbolt.org/z/Kbbh6zs4W ; it also generates the same SHR/AND/AND/LEA with -march=haswell. (I'd guess it wouldn't think of using rorx to preserve that input even when compiling code that did need it later.)

    These are all single-uop instructions on Intel and AMD, and your destination bit-position was low enough that both AND masks can fit in a sign-extended-imm8 so the AND instructions are 3 bytes each. (Instead of 6 for and r/m32, imm32). rorx is 6 bytes, though, with a VEX prefix and imm8. The total size is 14 bytes, or 20 bytes if the destination bit is outside the low 7. (Or low 8 if you use byte operand-size like and dl, 0x80 / ... / or cl, dl which causes partial-register problems on P6-family but is fine elsewhere.)

    (The instructions used in the question are also single-uop, including bt. On AMD CPUs bts and so on are 2 uops, but bt is just 1.)

    With a higher destination bit-number, you could save size using btr ecx, 30 (4 bytes, still 1 uop on Intel) instead of and ecx, ~(1<<30) (6 bytes, or 5 bytes into EAX). But that costs an extra uop on AMD.

    Of course if you care about code-size, you'd mov edx, eax / ror edx, 23-3 (5 bytes total) instead of using rorx (6 bytes). So that's a total of 17 bytes with a high destination bitpos. Or 15 if we can destroy EAX.

    If the bit positions were runtime-variables, this would be less efficient, needing a variable count shift. (And some subtraction or something to generate shift counts.) A different strategy might be better there.


    Another way to exchange bits between registers is with masked XORs, but that's not more efficient here where we don't want to swap, just go one way. And we can use the inverted mask as an immediate. (Or if in a register, with BMI1 andn.)


    The main problem is that x86 lacks a bitfield insert. Extract with shift/and is easy enough, although that's still 2 instructions unless you have AMD TBM for the immediate form of bextr, using the XOP encoding. (Bulldozer-family only.) Or use pext if you already have a constant in a register. It would be neat if there was an inverse of bt that deposited CF at the specified position, but there unfortunately isn't.

    If there was a bitfield insert instruction, either from CF or from the low bit of another register, you wouldn't need to mask.


    Avoiding a temp reg without rcr/rcl - only slightly slower, still 4 uops

    @rcgldr shows an interesting trick using rcr/rcl which is good for code size, but unfortunately slow on modern CPUs where rcl and rcr r32, imm are many uops, e.g. 7 uops on Zen3 with 3c throughput, and 8 uops on Sandybridge-family (including Alder Lake) with 6c throughput. (https://uops.info/ / https://agner.org/optimize/)

    That's 10 bytes of machine code and doesn't need a temp register. We can duplicate the functionality with only 12 bytes of machine code, but still only 4 single-uop instructions, assuming a modern-enough x86. The above version will be faster on Intel Haswell and earlier.

    This has a longer critical-path latency from ECX->result (3 cycles), but same for EAX->result (3 cycles), assuming single-uop adc and rotates. Also more of the uops compete for the shift units, so can't run on as wide a choice of back-end ports. Whether this matters depends on the surrounding code.

    It's the same code size even when the destination bit-position is > 8, and for 64-bit mode avoids needing any 8-byte masks.

    If you don't have a spare register you can clobber (including the input), this is very likely worth it. Or just for code-size, if this isn't a really hot part of your code.

    ;; 12 bytes total.  More latency through ECX, and some uops have fewer ports to choose from
       ror   ecx, 3+1         ; 1 uop on Intel HSW and later, and AMD
        ; the bit to be replaced is now at the top of ECX, and in CF
       bt    eax, 23          ; 1 uop
       adc   ecx, ecx         ; 1 uop on Broadwell and later, and AMD.
        ; Shift in the new bit, shifting out the old bit (into CF in case you care)
       rol   ecx, 3           ; 1 uop on HSW and later, and AMD
        ; restore ECX's bits to the right positions, overwriting CF
    

    The initial rotate-right can be rcr or ror; we don't care whether the bit we want to replace is shifted into the top bit temporarily, or only into CF. ror is much faster.

    We basically emulate rcl ecx, 3+1 with rcl ecx, 1 and then rol ecx, 3. It differs in FLAGS output I think, but matches in ECX result and how it reads from FLAGS.

    And then replace rcl r32, 1 with the equivalent but faster adc same,same; they differ only in the FLAGS output. adc doesn't have any weird partial-flags writing (leaving most of SPAZO unaffected) that makes rotates more expensive on Intel. adc was 2 uops on Intel before Broadwell, but has been 1 uop on AMD for a long time.

    This goes through FLAGS using bt so it can easily support a runtime-variable source bit position. For a variable destination bit-position, you'd have to calculate shift counts, and ror reg, cl is slower (3 uops on Intel). Unfortunately there's no variable-count rorx, only shlx / shrx.