When I play around with different compilers on https://godbolt.org, I notice that it's very common for compilers to generate code like this:
push rax
push rbx
push rcx
call rdx
pop rcx
pop rbx
pop rax
I understand that each push
or pop
does two things:
So in our example above, I assume the CPU is actually doing 12 operations (6 moves, 6 adds/subs), not including the call
. Wouldn't it be more efficient to combine the adds/subs? For example:
sub rsp, 24
mov [rsp-24], rax
mov [rsp-16], rbx
mov [rsp-8], rcx
call rdx
mov rcx, [rsp-8]
mov rbx, [rsp-16]
mov rax, [rsp-24]
add rsp, 24
Now there are only 8 operations (6 moves, 2 adds/subs), not including the call
. Why do compilers not use this approach?
If you compile with -mtune=pentium3
or something earlier than -mtune=pentium-m
, GCC will do code-gen like you imagined, because on those old CPUs push/pop really does decode to a separate ALU operation on the stack pointer as well as a load/store. (You'll have to use -m32
, or -march=nocona
(64-bit P4 Prescott) because those old CPUs also don't support x86-64). Why does gcc use movl instead of push to pass function args?
But Pentium-M introduced a "stack engine" in the front-end that eliminates the stack-adjustment part of stack ops like push/call/ret/pop. It effectively renames the stack pointer with zero latency. See Agner Fog's microarch guide and What is the stack engine in the Sandybridge microarchitecture?
As a general trend, any instruction that's in widespread use in existing binaries will motivate CPU designers to make it fast. For example, Pentium 4 tried to get everyone to stop using INC/DEC; that didn't work; current CPUs do partial-flag renaming better than ever. Modern x86 transistor and power budgets can support that kind of complexity, at least for the big-core CPUs (not Atom / Silvermont). Unfortunately I don't think there's any hope in sight for the false dependencies (on the destination) for instructions like sqrtss
or cvtsi2ss
, though.
Using the stack pointer explicitly in an instruction like add rsp, 8
requires the stack engine in Intel CPUs to insert a sync uop to update the out-of-order back-end's value of the register. Same if the internal offset gets too large.
In fact pop dummy_register
is more efficient than add rsp, 8
or add esp,4
on modern CPUs, so compilers will typically use that to pop one stack slot with the default tuning, or with -march=sandybridge
for example. Why does this function push RAX to the stack as the first operation?
See also What C/C++ compiler can use push pop instructions for creating local variables, instead of just increasing esp once? re: using push
to initialize local variables on the stack instead of sub rsp, n
/ mov
. That could be a win in some cases, especially for code-size with small values, but compilers don't do it.
Also, no, GCC / clang won't make code that's exactly like what you show.
If they need to save registers around a function call, they will typically do that using mov
to memory. Or mov
to a call-preserved register that they saved at the top of the function, and will restore at the end.
I've never seen GCC or clang push multiple call-clobbered registers before a function call, other than to pass stack args. And definitely not multiple pops afterwards to restore into the same (or different) registers. Spill/reload inside a function typically uses mov. This avoids the possibility of push/pop inside a loop (except for passing stack args to a call
), and allows the compiler to do branching without having to worry about matching pushes with pops. Also it reduces complexity of stack-unwind metadata that has to have an entry for every instruction that moves RSP. (Interesting tradeoff between instruction count vs. metadata and code size for using RBP as a traditional frame pointer.)
Something like your code-gen could be seen with call-preserved registers + some reg-reg moves in a tiny function that just called another function and then returned an __int128
that was a function arg in registers. So the incoming RSI:RDI would need to be saved, to return in RDX:RAX.
Or if you store to a global or through a pointer after a non-inline function call, the compiler would also need to save the function args until after the call.