Search code examples
performancevisual-studio-2010visual-c++x86stack-frame

Does omitting the frame pointers really have a positive effect on performance and a negative effect on debug-ability?


As was advised long time ago, I always build my release executables without frame pointers (which is the default if you compile with /Ox).

However, now I read in the paper http://research.microsoft.com/apps/pubs/default.aspx?id=81176, that frame pointers don't have much of an effect on performance. So optimizing it fully (using /Ox) or optimizing it fully with frame pointers (using /Ox /Oy-) doesn't really make a difference on peformance.

Microsoft seems to indicate that adding frame pointers (/Oy-) makes debugging easier, but is this really the case?

I did some experiments and noticed that:

  • in a simple 32-bit test executable (compiled using /Ox /Ob0) the omission of frame pointers does increase performance (with about 10%). But this test executable only performs some function calls, nothing else.
  • in my own application the adding/removing of frame pointers don't seem to have a big effect. Adding frame pointers seems to make the application about 5% faster, but that could be within the error margin.

What is the general advice regarding frame pointers?

  • should they be omitted (/Ox) in a release executable because they really have a positive effect on performance?
  • should they be added (/Ox /Oy-) in a release executable because they improve debug-ablity (when debugging with a crash-dump file)?

Using Visual Studio 2010.


Solution

  • Phoronix tested the performance downside of -O2 -fno-omit-frame-pointer with x86-64 GCC 12.1 on a Zen 3 laptop CPU for multiple open-source programs, as proposed for Fedora 37. Most of them had performance regressions, a few of them very serious, although the biggest ones are probably some kind of fluke or other interaction. Geometric mean slowdown of 14% (including those possible outliers).


    Short answer: By omitting the frame pointer,

    You need to use the stack pointer to access local variables and arguments. The compiler doesn't mind, but if you are coding in assembler, this makes your life slightly harder. Much harder if you don't use macros.

    You save four bytes (32-bit architecture) of stack space per function call. Unless you are using deep recursion, this isn't a win.

    You save a memory write to a cached memory (the stack) and you (theoretically) save a few clock ticks on function entry/exit, but you can increase the code size. Unless your function is doing very little very often (in which case it should be inlined), this shouldn't be noticeable.

    You free up a general-purpose register. If the compiler can utilize the register, it will produce code that is both substantially smaller and potentially faster. But, if most of the CPU time is spent talking to the main memory (or even the hard drive), omitting the frame pointer is not going save you from that.

    The debugger will lose an easy way to generate the stack trace. The debugger might still be able to able to generate the stack trace from a different source (such as a PDB file).


    Long answer:

    The typical function entry and exit is (16-bit processor):

    PUSH BP   ;push the base pointer (frame pointer)
    MOV BP,SP ;store the stack pointer in the frame pointer
    SUB SP,xx ;allocate space for local variables et al.
    ...
    LEAVE     ;restore the stack pointer and pop the old frame pointer
    RET       ;return from the function
    

    An entry and exit without a frame pointer could look like (32-bit processor):

    SUB ESP,xx ;allocate space for local variables et al.
    ...
    ADD ESP,xx ;de-allocate space for local variables et al.
    RET        ;return from the function.
    

    You will save two instructions, but you also duplicate a literal value, so the code doesn't get shorter (quite the opposite, especially with [esp+xx] addressing modes taking an extra byte vs. [ebp+xx]), but you might have saved a few clock cycles (or not, if it causes a cache miss in the instruction cache). You did save some space on the stack, though.


    You do free up a general-purpose register. This has only benefits.

    In regcall/fastcall, this is one extra register where you can store arguments to your function. Thus, if your function takes seven (on x86; more on most other architectures) or more arguments (including this), the seventh argument still fits into a register. (Although most calling conventions don't pass that many in registers, e.g., two for MS fastcall, three for GCC regparm(3) on 32-bit x86. Up to six integer register arguments on x86-64 System V, or 4 register arguments on most RISC processors.)

    The same, more importantly, applies to local variables as well. Arrays and large objects don't fit into registers (but pointers to them do), but if your function is using seven different local variables (including temporary variables needed to calculate complex expressions), chances are the compiler will be able to produce smaller code. Smaller code means lower instruction cache footprint, which means reduced miss rate and thus even less memory access (but Intel Atom has a 32K instruction cache, meaning that your code will probably fit anyway).

    The x86 architecture features the [BX/BP/SI/DI] and [BX/BP + SI/DI] addressing modes. This makes the BP register an extremely useful place for a scaled array index, especially if the array pointer resides in the SI or DI registers. Two offset registers are better than one.

    Utilising a register avoids memory access, but if a variable is worth storing in a register, chances are it will survive just as fine in an L1 cache (especially since it's going to be on the stack). There is still the cost of moving to/from the cache, but since modern CPUs do a lot move optimisation and parallelisation, it is possible that an L1 access would be just as fast as a register access. Thus, the speed benefit from not moving data around is still present, but not as enormous. I can easily imagine the CPU avoiding the data cache completely, at least as far as reading is concerned (and writing to cache can be done in parallel).

    A register that is utilised is a register that needs preserving. It is not worth storing much in the registers if you are going to push it to the stack anyway before you use it again. In preserve-by-caller calling conventions (such as the one above), this means that registers as persistent storage are not as useful in a function that calls other functions a lot.

    See What are callee and caller saved registers? for more about how calling conventions are designed with a mix of call-clobbered and call-preserved registers to give compilers a good mix of each, so functions have some scratch registers for temporaries that don't need to live across function calls, but also some registers that callees will preserve. Also Why make some registers caller-saved and others callee-saved? Why not make the caller save everything it wants saved?

    Also note that x86 has a separate register space for floating point registers, meaning that floats cannot utilise the BP register without extra data movement instructions anyway. Only integers and memory pointers do.


    You do lose debugability by omitting frame pointers. This answer show why:

    If the code crashes, all the debugger needs to do to generate the stack trace is:

        PUSH BP      ; log the current frame pointer as well
    $1: CALL log_BP  ; log the frame pointer currently on stack
        LEAVE        ; pop the frame pointer to get the next one
        CMP [BP+4],0
        JNZ $1       ; until the stack cannot be popped (the return address is some specific value)
    

    If the code crashes without a frame pointer, the debugger might not have any way to generate the stack trace, because it might not know (namely, it needs to locate the function entry/exit point) how much needs to be subtracted from the stack pointer. If the debugger doesn't know the frame pointer is not being used, it might even crash itself.

    Modern debug-info formats have metadata that still allows stack backtraces in optimized code where the compiler defaults to not using [E/R]BP as a frame pointer. Compilers know how to use assembler directives to create this extra metadata, or write it directly in the object file, not in the parts that normally get mapped into memory. If you don't do this for hand-written assembly, then debugability would suffer, especially for crashes in functions called by a hand-written assembly function.