assembly gcc x86-64 addition cpu-architecture

gcc using `lea` instead of `add`

I wanted to remember how a compiler performs integer division by 2. Interestingly, I found a particular behavior that I do not understand:

https://godbolt.org/z/87n3x5Gjv

    int div2(int x)
    {
        return x / 2;
    }

x86_64 gcc 13.2 -O1

    div2(int):
        mov     eax, edi
        shr     eax, 31
        lea     eax, [rax+rdi]
        sar     eax
        ret

I personally would have swapped lea with add, -O0 and -O2 do that, also clang does that at any level. Does this have something to do with the flags that the add instruction might modify? I would love to know why something like this might happen, thanks in advance!

Solution

I think LEA is worse here, probably the result of some heuristic gone wrong. You could report it with the "missed optimization" keyword on GCC's bugzilla, https://gcc.gnu.org/bugzilla/
I've noticed this, too, that some compilers seem to prefer LEA for no reason even when they don't need to copy the result to a destination that wasn't either of the inputs.

LEA runs on fewer ports on some CPUs, such Intel before Ice Lake. At least it's still a "simple" LEA on all modern CPUs, 2 components and no scale factor on the index. Otherwise it could run on even fewer ports, and might have latency higher than 1 cycle. (https://uops.info/) (But doing the work of 2 to 4 instructions with a more complex LEA is usually still worth it, unless it's part of a loop-carried dependency chain.)

LEA does cost extra code size for the SIB byte in the addressing mode. (opcode + ModRM + SIB for 3-byte lea, just opcode + ModRM for 2-byte add eax, edi)

Not writing FLAGS isn't useful here. x86 CPUs handle FLAGS writes fully efficiently, even using the same physical-register-file entry for both the FLAGS and the integer result. (Sandybridge-family definitely does this; I assume Zen and others do as well, instead of needing another whole register file and register-allocation-table or something, or needing uops to have 2 outputs.) Some GCC devs might be unsure about this: I seem to recall a comment on a GCC bugzilla issue suggesting that not modifying FLAGS would have some advantage, like less work for the CPU not to have to rename it. That's not the case.

When you're already writing a register destination, writing FLAGS as well has zero extra cost. It might even allow freeing up an older physical register that was holding just a FLAGS result but not also the current value of any integer reg.