Search code examples
assemblyx86micro-optimization

How to copy a register and do `x*4 + constant` with the minimum number of instructions


I am new to x86 assembly. For example for the following instruction: multiply the contents of ESP by 4 and add 0x11233344, storing the result in EDI.

How do I represent this instruction in x86 assembly with minimum number of instructions?

push esp
mov edi, 4
mul edi
add edi, 0x11233344

Solution

  • Your asm doesn't make any sense (push esp copies to memory, not another register), and mul edi writes EDX:EAX not edi. It does EDX:EAX = EAX * src_operand. Read the manual: https://www.felixcloutier.com/x86/MUL.html. Or better, use imul instead unless you actually need the high-half output of the full 32x32 => 64-bit multiply.

    Also, don't use the stack pointer register ESP to hold temporary values unless you know exactly what you're doing (e.g. you're in user-space, and you've made sure no signal handlers can asynchronously use the stack.) stack-pointer * 4 + large-constant is not something that a normal program would ever do.


    Normally you could do this in one LEA instruction but ESP is the only register that can't be an index in an x86 address mode. See rbp not allowed as SIB base? (The index is the part of an addressing mode that can have a 2-bit shift count applied, aka a scale factor).

    I think our best bet is still just to copy ESP to EDI, then use LEA:

     mov  edi, esp
     lea  edi, [edi * 4 + 0x11223344]
    

    Or you could copy-and-add with LEA, and then left shift, because the value we're adding has two zeros as its low bits (i.e. it's a multiple of 4). So we can right shift it by 2 without losing any bits.

    SHIFTED_ADD_CONSTANT equ 0x11223344 >> 2
    
      lea    edi, [esp + SHIFTED_ADD_CONSTANT]
      shl    edi, 2
    

    The add before left-shifting will produce carry into the top 2 bits, but we're about to shift those bits out so it doesn't matter what's there.

    This is also 2 uops, and more efficient on AMD Bulldozer-family CPUs (no mov-elimination for GP-integer mov, and where a scaled index costs an extra cycle of latency for LEA). Zen has mov-elimination but I think still the same LEA latencies so both versions are 2 cycle latency. Even "complex" LEA has 2/clock throughput on Zen, or 4/clock for simple LEA (any ALU port).

    But less efficient on Intel IvyBridge and later CPUs where the mov can run with zero latency (mov elimination), and the [edi*4 + disp32] addressing mode is still a fast 2-component LEA. So on Intel CPUs with mov-elimination, the first version is 2 front-end uops, 1 unfused-domain uop for an execution unit, and only 1 cycle of latency.

    Another 2-instruction option is to use a slower imul instead of a fast shift. (Addressing modes use a shift: even though it's written as * 1 / 2 / 4 / 8, it's encoded in a 2-bit shift-count field in machine code).

      imul  edi, esp, 4       ; this is dumb, don't use mul/imul for powers of 2.
      add   edi, 0x11223344
    

    imul has 3 cycle latency on modern x86 CPUs which is pretty good, but is slower on old CPUs like Pentium 3. Still not as good as 1 or 2-cycle latency for mov + LEA, and imul runs on fewer ports.


    (Number of instructions is not usually the thing to optimize for; number of uops usually matters more, and latency / back-end throughput. Also code-size in bytes of x86 machine code; different instructions are different lengths.)