c++assembly gcc x86-64 compiler-optimization

Why do GCC and Clang pop on both branches instead of only once? (Factoring parts of the epilogue out of tail-duplication)

GCC and Clang both compile

bool pred();
void f();
void g();

void h() {
    if (pred()) {
        f();
    } else {
        g();
    }
}

to some variation of

# Clang -Os output.  -O3 is the same
h():
    push    rax
    call    pred()@PLT
    test    al, al
    je      .LBB0_2
    pop     rax
    jmp     f()@PLT
.LBB0_2:
    pop     rax
    jmp     g()@PLT

when optimizing for code size. You can see this on Compiler Explorer.

Is there a reason for emitting a pop instruction on both branches instead of emitting it only once before je?

For example popping into a different call-clobbered register to avoid destroying the pred() return value, or using add rsp, 8 (which is maybe not actually faster on modern CPUs in this case since it needs a stack sync uop.)

# hand-written example of what compilers could be doing
h():
    push    rax             # align the stack
    call    pred()@PLT
    pop     rcx             # clean up the stack, restoring the stack pointer to its initial state
    test    al, al
    je      .LBB0_2
    jmp     f()@PLT        # tailcall f
.LBB0_2:
    jmp     g()@PLT        # tailcall g

Solution

You're right, it could call / pop rcx / test al,al / ...

In this case both paths just tailcall without any other work. It would be smaller static code size, and not obviously better or worse in any other way. GCC/clang should be doing what you suggest at -O3 as well.

It's not always possible to destroy the stack frame before branching and setting up the args (if any) for a tailcall, so there's a bunch of things a compiler would have to check on to do this safely in the general case. That kind of reason is common for missed optimizations in tiny simple functions: code that looks for them has to be correct for non-tiny functions, and the benefit has to be worth the compile time, and maintenance cost of the code in the GCC / LLVM source.

It does put more branch instructions closer to each other, which can be problematic for branch prediction in some CPUs. (Even unconditional branches need some prediction to avoid stalling the front-end, since code-fetch is pipelined with decode.) Prediction accuracy can be lower when a lot of branches are nearby. pop is only a 1-byte instruction so probably doesn't make much difference.

This is a missed optimization you can report on https://gcc.gnu.org/bugzilla/ and https://github.com/llvm/llvm-project/issues/ if there aren't already existing duplicates. Relevant search terms might include "tail duplication", "epilogue", "tear down stack frame".

Finding this optimization is a bit like the opposite of shrink-wrap optimization. (Which is when you do the prologue after an early-out branch that might mean it's not needed.) This would be tearing down the stack frame early, while there's still some work left to do, only involving values in call-clobbered registers.

BTW, there's another missed optimization here: x86-64 direct jmp rel32 has the same range as jcc rel32, so it could be using JCC as a tailcall (GCC bug 82516 has a short reply from Richard Biener which sheds some light on why GCC misses it.)

# Hand-written with both optimizations applied
h:
    push    rax         # Align the stack
    call    pred()@PLT
    pop     rcx         # Clean up the stack, restoring the stack
                        # pointer to its initial state
    test    al, al
    je      g()@PLT
    jmp     f()@PLT     # Tailcall f

Unless you compile with -fno-plt, then you get jmp qword ptr [rip + f()@GOTPCREL] # TAILCALL which has no JCC equivalent, instead of jmp f()@PLT.

(If f and g are present at link time, the linker can relax the indirect call to addr32 jmp f (with the 67h addr32 prefix byte just filling space to make it the same length as the indirect jump it's replacing. Just like how it can relax jumps/calls to f@PLT into just f if its statically linked, not in a different shared object.)

Stuff like __attribute__((visibility("hidden"))) could tell the compiler f and g are internal to the thing it's compiling, either the main executable or a shared library, so it can use direct calls in the first place.