GCC and Clang both compile
bool pred();
void f();
void g();
void h() {
if (pred()) {
f();
} else {
g();
}
}
to some variation of
# Clang -Os output. -O3 is the same
h():
push rax
call pred()@PLT
test al, al
je .LBB0_2
pop rax
jmp f()@PLT
.LBB0_2:
pop rax
jmp g()@PLT
when optimizing for code size. You can see this on Compiler Explorer.
Is there a reason for emitting a pop
instruction on both branches instead of emitting it only once before je
?
For example popping into a different call-clobbered register to avoid destroying the pred() return value, or using add rsp, 8
(which is maybe not actually faster on modern CPUs in this case since it needs a stack sync uop.)
# hand-written example of what compilers could be doing
h():
push rax # align the stack
call pred()@PLT
pop rcx # clean up the stack, restoring the stack pointer to its initial state
test al, al
je .LBB0_2
jmp f()@PLT # tailcall f
.LBB0_2:
jmp g()@PLT # tailcall g
You're right, it could call
/ pop rcx
/ test al,al
/ ...
In this case both paths just tailcall without any other work. It would be smaller static code size, and not obviously better or worse in any other way. GCC/clang should be doing what you suggest at -O3
as well.
It's not always possible to destroy the stack frame before branching and setting up the args (if any) for a tailcall, so there's a bunch of things a compiler would have to check on to do this safely in the general case. That kind of reason is common for missed optimizations in tiny simple functions: code that looks for them has to be correct for non-tiny functions, and the benefit has to be worth the compile time, and maintenance cost of the code in the GCC / LLVM source.
It does put more branch instructions closer to each other, which can be problematic for branch prediction in some CPUs. (Even unconditional branches need some prediction to avoid stalling the front-end, since code-fetch is pipelined with decode.) Prediction accuracy can be lower when a lot of branches are nearby. pop
is only a 1-byte instruction so probably doesn't make much difference.
This is a missed optimization you can report on https://gcc.gnu.org/bugzilla/ and https://github.com/llvm/llvm-project/issues/ if there aren't already existing duplicates. Relevant search terms might include "tail duplication", "epilogue", "tear down stack frame".
Finding this optimization is a bit like the opposite of shrink-wrap optimization. (Which is when you do the prologue after an early-out branch that might mean it's not needed.) This would be tearing down the stack frame early, while there's still some work left to do, only involving values in call-clobbered registers.
BTW, there's another missed optimization here: x86-64 direct jmp rel32
has the same range as jcc rel32
, so it could be using JCC as a tailcall (GCC bug 82516 has a short reply from Richard Biener which sheds some light on why GCC misses it.)
# Hand-written with both optimizations applied
h:
push rax # Align the stack
call pred()@PLT
pop rcx # Clean up the stack, restoring the stack
# pointer to its initial state
test al, al
je g()@PLT
jmp f()@PLT # Tailcall f
Unless you compile with -fno-plt
, then you get jmp qword ptr [rip + f()@GOTPCREL] # TAILCALL
which has no JCC equivalent, instead of jmp f()@PLT
.
(If f
and g
are present at link time, the linker can relax the indirect call to addr32 jmp f
(with the 67h
addr32 prefix byte just filling space to make it the same length as the indirect jump it's replacing. Just like how it can relax jumps/calls to f@PLT
into just f
if its statically linked, not in a different shared object.)
Stuff like __attribute__((visibility("hidden")))
could tell the compiler f
and g
are internal to the thing it's compiling, either the main executable or a shared library, so it can use direct calls in the first place.