What's faster:
add DWORD PTR [rbp-0x4],1
or
mov eax,DWORD PTR [rbp-0x4]
add eax,1
mov DWORD PTR [rbp-0x4],eax
I've seen the second code generated by a compiler, so maybe calling add
on a register is much faster?
They both decode to the same amount of back-end uops, but the memory-destination add
gets those uops through the front-end in fewer fused-domain uops on modern Intel/AMD CPUs.
On Intel CPUs, add [mem], imm
decodes to a micro-fused load+add and a micro-fused store-address+store-data, so 2 total fused-domain uops for the front-end. AMD CPUs always keep memory operands grouped with the ALU operation without calling it "micro-fusion", it's just how they've always worked.
(https://agner.org/optimize/ and INC instruction vs ADD 1: Does it matter?).
The first way doesn't leave the value in a register, so you couldn't use it as part of ++a
if the value of the expression was used. Only for the the side-effect on memory.
Using [rbp - 4]
and incrementing a local in memory smells like un-optimized / debug-mode code, which you should not be looking at for what's efficient. Optimized code typically uses [rsp +- constant]
to address locals, and (unless the variable is volatile
) wouldn't be just storing it back into memory again right away.
Why does clang produce inefficient asm with -O0 (for this simple floating point sum)? - compiling in debug mode, aka -O0
(the default) compiles each C statement separately, and treats every variable sort of like volatile
, which is totally horrible.
See How to remove "noise" from GCC/clang assembly output? for how to get compilers to make asm that's interesting to look at. Write a function that takes args and returns a value so it can do something without optimizing away or propagating constants into mov eax, constant_result
.