Search code examples
rustx86-64stack-overflowinline-assembly

Why are Rust stack frames so big?


I encountered an unexpectedly early stack overflow and created the following program to test the issue:

#![feature(asm)]
#[inline(never)]
fn get_rsp() -> usize {
    let rsp: usize;
    unsafe {
        asm! {
            "mov {}, rsp",
            out(reg) rsp
        }
    }
    rsp
}

fn useless_function(x: usize) {
    if x > 0 {
        println!("{:x}", get_rsp());
        useless_function(x - 1);
    }
}

fn main() {
    useless_function(10);
}

This is get_rsp disassembled (according to cargo-asm):

tests::get_rsp:
 push    rax
 #APP
 mov     rax, rsp
 #NO_APP
 pop     rcx
 ret

I'm not sure what #APP and #NO_APP do or why rax is pushed and then popped into rcx, but it seems the function does return the stack pointer.

I was surprised to find that in debug mode, the difference between two consecutively printed rsp was 192(!) and even in release mode it was 128. As far as I understand, all that needs to be stored for each call to useless_function is one usize and a return address, so I'd expect every stack frame to be around 16 bytes large.

I'm running this with rustc 1.46.0 on a 64-bit Windows machine.

Are my results consistent across machine? How is this explained?


It seems that the use of println! has a pretty significant effect. In an attempt to avoid that, I changed the program (Thanks to @Shepmaster for the idea) to store the values in a static array:

static mut RSPS: [usize; 10] = [0; 10];

#[inline(never)]
fn useless_function(x: usize) {
    unsafe { RSPS[x] = get_rsp() };
    if x == 0 {
        return;
    }
    useless_function(x - 1);
}

fn main() {
    useless_function(9);
    println!("{:?}", unsafe { RSPS });
}

The recursion gets optimised away in release mode, but in debug mode each frame still takes 80 bytes which is way more than I anticipated. Is this just the way stack frames work on x86? Do other languages do better? This seems a little inefficient.


Solution

  • This answer shows how this works in asm for an un-optimized C++ version.

    This might not tell us as much as I thought about Rust; apparently Rust uses its own ABI / calling convention so it won't have "shadow space" making its stack frames bulkier on Windows. The first version of my answer guessed that it would follow the Windows calling convention for calls to other Rust functions, when targeting Windows. I've adjusted the wording, but I didn't delete it even though it's potentially not relevant to Rust.

    After further research, at least in 2016 Rust's ABI happens to match the platform calling convention on Windows x64, at least if disassembly of the debug-build binary in this random tutorial is representative of anything. heap::allocate::h80a36d45ddaa4ae3Lca in the disassembly clearly takes args in RCX and RDX, (spills and reloads them to the stack), then calls another function with those args. Leaving 0x20 bytes of space unused above RSP before the call, i.e. shadow space.

    If nothing has changed since 2016 (easily possible), I think this answer does reflect some of what Rust does when compiling for Windows.


    The recursion gets optimised away in release mode, but in debug mode each frame still takes 80 bytes which is way more than I anticipated. Is this just the way stack frames work on x86? Do other languages do better?

    Yes, C and C++ do better: 48 or 64 bytes per stack frame on Windows, 32 on Linux.

    The Windows x64 calling convention requires a caller to reserve 32 bytes of shadow space (basically unused stack-arg space above the return address) for use by the callee. But it looks like un-optimized clang builds may not take advantage of that shadow space, allocating extra space to spill local vars.

    Also, the return address takes 8 bytes, and re-aligning the stack by 16 before another call takes another 8 bytes, so the minimum you can hope for is 48 bytes on Windows (unless you enable optimization, then as you say, tail-recursion easily optimizes into a loop). GCC compiling a C or C++ version of that code does achieve that.

    Compiling for Linux, or any other x86-64 target that uses the x86-64 System V ABI, gcc and clang manage 32 bytes per frame for a C or C++ version. Just ret addr, saved RBP, and another 16 bytes to keep alignment while making room to spill 8-byte x. (Compiling as C or as C++ makes no difference to the asm).


    I tried GCC and clang on an un-optimized C++ version using the Windows calling convention on the Godbolt compiler explorer. To just look at the asm for useless_function, there was no need to write a main or get_rsp.

    #include <stdlib.h>
    
    #define MS_ABI __attribute__((ms_abi))   // for GNU C compilers.  Godbolt link has an ifdeffed version of this
    
    void * RSPS[10] = {0};
    
    MS_ABI void *get_rsp(void);
    MS_ABI void useless_function(size_t x) {
        RSPS[x] = get_rsp();
        if (x == 0) {
            return;
        }
        useless_function(x - 1);
    }
    

    clang/LLVM un-optimized does push rbp / sub rsp, 48, so a total of 64 bytes per frame (including the return address). GCC does push / sub rsp,32, for a total of only 48 bytes per frame, as predicted.

    So apparently un-optimized LLVM does allocate "unneeded" space because it fails to use the shadow space allocated by the caller. If Rust used shadow space, this might explains some of why your debug-mode Rust version might use more stack space than we might expect, even with printing done outside the recursive function. (Printing uses a lot of space for locals).

    But part of that explanation must also include having some locals that take more space, e.g. perhaps for pointer locals or bounds checks? C and C++ map pretty directly to asm, with access to globals not needing any extra stack space. (Or even extra registers, when the global array can be assumed to be in the low 2GiB of virtual address space, so it's address is usable as a 32-bit signed displacement in combination with other registers.)

    # clang 10.0.1 -O0, for Windows x64
    useless_function(unsigned long):
            push    rbp
            mov     rbp, rsp                  # set up a legacy frame pointer.
            sub     rsp, 48                   # reserve enough for shadow space (32) + 16, maintaining stack alignment.
            mov     qword ptr [rbp - 8], rcx   # spill incoming arg to newly reserved space above the shadow space
            call    get_rsp()
    ...
    

    The only space for locals used on the stack is for x, no invented temporaries as part of array access. It's just a reload of x then mov qword ptr [8*rcx + RSPS], rax to store the function call return value.

    # GCC10.2 -O0, for Windows x64
    useless_function(unsigned long):
            push    rbp
            mov     rbp, rsp
            sub     rsp, 32                   # just reserve enough for shadow space for callee
            mov     QWORD PTR [rbp+16], rcx   # spill incoming arg to our own shadow space
            call    get_rsp()
    ...
    

    Without the ms_abi attribute, both GCC and clang use sub rsp, 16.