Search code examples
assemblyx86x86-64micro-optimization

Is calling `add` on a memory location faster than calling it on a register and then moving the value?


What's faster:

add DWORD PTR [rbp-0x4],1

or

 mov    eax,DWORD PTR [rbp-0x4]
 add    eax,1
 mov    DWORD PTR [rbp-0x4],eax

I've seen the second code generated by a compiler, so maybe calling add on a register is much faster?


Solution

  • They both decode to the same amount of back-end uops, but the memory-destination add gets those uops through the front-end in fewer fused-domain uops on modern Intel/AMD CPUs.

    On Intel CPUs, add [mem], imm decodes to a micro-fused load+add and a micro-fused store-address+store-data, so 2 total fused-domain uops for the front-end. AMD CPUs always keep memory operands grouped with the ALU operation without calling it "micro-fusion", it's just how they've always worked. (https://agner.org/optimize/ and INC instruction vs ADD 1: Does it matter?).


    The first way doesn't leave the value in a register, so you couldn't use it as part of ++a if the value of the expression was used. Only for the the side-effect on memory.


    Using [rbp - 4] and incrementing a local in memory smells like un-optimized / debug-mode code, which you should not be looking at for what's efficient. Optimized code typically uses [rsp +- constant] to address locals, and (unless the variable is volatile) wouldn't be just storing it back into memory again right away.

    Why does clang produce inefficient asm with -O0 (for this simple floating point sum)? - compiling in debug mode, aka -O0 (the default) compiles each C statement separately, and treats every variable sort of like volatile, which is totally horrible.

    See How to remove "noise" from GCC/clang assembly output? for how to get compilers to make asm that's interesting to look at. Write a function that takes args and returns a value so it can do something without optimizing away or propagating constants into mov eax, constant_result.