Search code examples
c++assemblyg++

Remove needless assembler statements from g++ output


I am investigating some problem with a local binary. I've noticed that g++ creates a lot of ASM output that seems unnecessary to me. Example with -O0:

Derived::Derived():
    pushq   %rbp
    movq    %rsp, %rbp
    subq    $16, %rsp          <--- just need 8 bytes for the movq to -8(%rbp), why -16?
    movq    %rdi, -8(%rbp)
    movq    -8(%rbp), %rax
    movq    %rax, %rdi         <--- now we have moved rdi onto itself.
    call    Base::Base()
    leaq    16+vtable for Derived(%rip), %rdx
    movq    -8(%rbp), %rax     <--- effectively %edi, does not point into this area of the stack
    movq    %rdx, (%rax)       <--- thus this wont change -8(%rbp)
    movq    -8(%rbp), %rax     <--- so this statement is unnecessary
    movl    $4712, 12(%rax)
    nop
    leave
    ret

option -O1 -fno-inline -fno-elide-constructors -fno-omit-frame-pointer:

Derived::Derived():
    pushq   %rbp
    movq    %rsp, %rbp
    pushq   %rbx
    subq    $8, %rsp       <--- reserve some stack space and never use it.
    movq    %rdi, %rbx
    call    Base::Base()
    leaq    16+vtable for Derived(%rip), %rax
    movq    %rax, (%rbx)
    movl    $4712, 12(%rbx)
    addq    $8, %rsp       <--- release unused stack space.
    popq    %rbx
    popq    %rbp
    ret

This code is for the constructor of Derived that calls the Base base constructor and then overrides the vtable pointer at position 0 and sets a constant value to an int member it holds in addition to what Base contains.

Question:

  • Can I translate my program with as few optimizations as possible and get rid of such stuff? Which options would I have to set? Or is there a reason the compiler cannot detect these cases with -O0 or -O1 and there is no way around them?
  • Why is the subq $8, %rsp statement generated at all? You cannot optimize in or out a statement that makes no sense to begin with. Why does the compiler generate it then? The register allocation algorithm should never, even with O0, generate code for something that is not there. So why it is done?

Solution

  • I don't see any obvious missed optimizations in your -O1 output. Except of course setting up RBP as a frame pointer, but you used -fno-omit-frame-pointer so clearly you know why GCC didn't optimize that away.

    The function has no local variables

    Your function is a non-static class member function, so it has one implicit arg: this in rdi. Which g++ spills to the stack because of -O0. Function args count as local variables.

    How does a cyclic move without an effect improve the debugging experience. Please elaborate.

    To improve C/C++ debugging: debug-info formats can only describe a C variable's location relative to RSP or RBP, not which register it's currently in. Also, so you can modify any variable with a debugger and continue, getting the expected results as if you'd done that in the C++ abstract machine. Every statement is compiled to a separate block of asm with no values alive in registers (Fun fact: except register int foo: that keyword does affect debug-mode code gen).

    Why does clang produce inefficient asm with -O0 (for this simple floating point sum)? applies to G++ and other compilers as well.

    Which options would I have to set?

    If you're reading / debugging the asm, use at least -Og or higher to disable the debug-mode spill-everything-between-statements behaviour of -O0. Preferably -O2 or -O3 unless you like seeing even more missed optimizations than you'd get with full optimization. But -Og or -O1 will do register allocation and make sane loops (with the conditional branch at the bottom), and various simple optimizations. Although still not the standard peephole of xor-zeroing.

    How to remove "noise" from GCC/clang assembly output? explains how to write functions that take args and return a value so you can write functions that don't optimize away.

    Loading into RAX and then movq %rax, %rdi is just a side-effect of -O0. GCC spends so little time optimizing the GIMPLE and/or RTL internal representations of the program logic (before emitting x86 asm) that it doesn't even notice it could have loaded into RDI in the first place. Part of the point of -O0 is to compile quickly, as well as consistent debugging.

    Why is the subq $8, %rsp statement generated at all?

    Because the ABI requires 16-byte stack alignment before a call instruction, and this function did an even number of 8-byte pushes. (call itself pushes a return address). It will go away at -O1 without -fno-omit-frame-pointer because you aren't forcing g++ to push/pop RBP as well as the call-preserved register it actually needs.

    Why does System V / AMD64 ABI mandate a 16 byte stack alignment?

    Fun fact: clang will often use a dummy push %rcx/pop or something, depending on -mtune options, instead of an 8-byte sub.

    If it were a leaf function, g++ would just use the red-zone below RSP for locals, even at -O0. Why is there no "sub rsp" instruction in this function prologue and why are function parameters stored at negative rbp offsets?


    In un-optimized code it's not rare for G++ to allocate an extra 16 bytes it doesn't ever use. Even sometimes with optimization enabled g++ rounds up its stack allocation size too far when aiming for a 16-byte boundary. This is a missed-optimization bug. e.g. Memory allocation and addressing in Assembly