assembly x86 x86-64 cpu-architecture micro-optimization

Why use push/pop instead of sub and mov?

When I play around with different compilers on https://godbolt.org, I notice that it's very common for compilers to generate code like this:

push    rax
push    rbx
push    rcx
call    rdx
pop     rcx
pop     rbx
pop     rax

I understand that each push or pop does two things:

move the operand to/from the stack space
increment/decrement the stack pointer (rsp)

So in our example above, I assume the CPU is actually doing 12 operations (6 moves, 6 adds/subs), not including the call. Wouldn't it be more efficient to combine the adds/subs? For example:

sub rsp, 24
mov [rsp-24], rax
mov [rsp-16], rbx
mov [rsp-8], rcx
call    rdx
mov rcx, [rsp-8]
mov rbx, [rsp-16]
mov rax, [rsp-24]
add rsp, 24

Now there are only 8 operations (6 moves, 2 adds/subs), not including the call. Why do compilers not use this approach?

Solution

If you compile with -mtune=pentium3 or something earlier than -mtune=pentium-m, GCC will do code-gen like you imagined, because on those old CPUs push/pop really does decode to a separate ALU operation on the stack pointer as well as a load/store. (You'll have to use -m32, or -march=nocona (64-bit P4 Prescott) because those old CPUs also don't support x86-64). Why does gcc use movl instead of push to pass function args?

But Pentium-M introduced a "stack engine" in the front-end that eliminates the stack-adjustment part of stack ops like push/call/ret/pop. It effectively renames the stack pointer with zero latency. See Agner Fog's microarch guide and What is the stack engine in the Sandybridge microarchitecture?

As a general trend, any instruction that's in widespread use in existing binaries will motivate CPU designers to make it fast. For example, Pentium 4 tried to get everyone to stop using INC/DEC; that didn't work; current CPUs do partial-flag renaming better than ever. Modern x86 transistor and power budgets can support that kind of complexity, at least for the big-core CPUs (not Atom / Silvermont). Unfortunately I don't think there's any hope in sight for the false dependencies (on the destination) for instructions like sqrtss or cvtsi2ss, though.

Using the stack pointer explicitly in an instruction like add rsp, 8 requires the stack engine in Intel CPUs to insert a sync uop to update the out-of-order back-end's value of the register. Same if the internal offset gets too large.

In fact pop dummy_register is more efficient than add rsp, 8 or add esp,4 on modern CPUs, so compilers will typically use that to pop one stack slot with the default tuning, or with -march=sandybridge for example. Why does this function push RAX to the stack as the first operation?

See also What C/C++ compiler can use push pop instructions for creating local variables, instead of just increasing esp once? re: using push to initialize local variables on the stack instead of sub rsp, n / mov. That could be a win in some cases, especially for code-size with small values, but compilers don't do it.

Also, no, GCC / clang won't make code that's exactly like what you show.

If they need to save registers around a function call, they will typically do that using mov to memory. Or mov to a call-preserved register that they saved at the top of the function, and will restore at the end.

I've never seen GCC or clang push multiple call-clobbered registers before a function call, other than to pass stack args. And definitely not multiple pops afterwards to restore into the same (or different) registers. Spill/reload inside a function typically uses mov. This avoids the possibility of push/pop inside a loop (except for passing stack args to a call), and allows the compiler to do branching without having to worry about matching pushes with pops. Also it reduces complexity of stack-unwind metadata that has to have an entry for every instruction that moves RSP. (Interesting tradeoff between instruction count vs. metadata and code size for using RBP as a traditional frame pointer.)

Something like your code-gen could be seen with call-preserved registers + some reg-reg moves in a tiny function that just called another function and then returned an __int128 that was a function arg in registers. So the incoming RSI:RDI would need to be saved, to return in RDX:RAX.

Or if you store to a global or through a pointer after a non-inline function call, the compiler would also need to save the function args until after the call.