windows assembly x86-64 calling-convention stack-memory

Is the caller or callee responsible for freeing shadow store in x64 assembly (windows)?

Coming from C and C++, I have recently started to learn x86-64 assembly to understand better the workings of my programs.

I know that the convention in x64 assembly is to reserve 32 bytes of 'shadow store' on the stack before calling a function (by doing: subq $0x20, %rsp).

What I am unsure about is: is the callee responsible for incrementing %rsp again, or the caller?

In other words (using printf as an example), would number 1 or number 2 be correct (or perhaps neither :P)?

subq $0x20, %rsp
movabsq $msg, %rcx
callq printf

subq $0x20, %rsp
movabsq $msg, %rcx
callq printf
addq $0x20, %rsp

(... where msg is an ascii string stored in the .data section that I am passing to printf)

I am on Windows 10, using GAS as my assembler.

Any help would be much appreciated, cheers.

Solution

Deallocating shadow space is the caller's responsibility.

But normally you'd do it once per function, not once per call-site within a function. Usually you just move RSP once (maybe after some pushes) and leave it alone until you're ready to return. That includes making room to store stack args if any for functions with more than 4 args.

In the Windows x64 calling convention (and x86-64 System V), the callee must return without changing the caller's RSP. i.e. with ret, not ret 32, and without having copied the return address somewhere else.

MS has some examples in https://learn.microsoft.com/en-us/cpp/build/prolog-and-epilog?view=msvc-170#epilog-code
And specifically documents that RSP mustn't be changed by functions:

The x64 ABI considers registers RBX, RBP, RDI, RSI, RSP, R12, R13, R14, R15, and XMM6-XMM15 nonvolatile. They must be saved and restored by a function that uses them.

(You also need to emit unwind metadata for every instruction that moves the stack pointer, and about where you saved non-volatile aka call-preserved registers, if you want to be fully compliant with the ABI, including for SEH and C++ exception unwinding. Toy programs still work fine without, as long as you don't expect C++ exceptions to work, or debuggers to unwind the stack back to the stack frame of a caller.)

You can see this if you look at MSVC compiler output, e.g. https://godbolt.org/z/xh38jxWqT , or for AT&T syntax, gcc -O2 -mabi=ms to tell it that all the functions it sees are __attribute__((ms_abi)) by default, but it doesn't override the fact that it's targeting Linux. So with -fPIE to make it use LEA instead of 32-bit absolute addressing for symbol addresses, we also get call printf@plt, not Windows style calls to DLL functions.

But the stack management from GCC matches what MSVC -O2 also does.

#include <stdio.h>

void bar();
int foo(){
    printf("%d\n", 1);
    bar();
    return 1;  // make sure this isn't a tailcall
}

# gcc -O2 -mabi=ms  (but still sort of targeting Linux as far as dynamic linking)
.LC0:
        .string "%d\n"      ## in .rodata

foo():
        subq    $40, %rsp
        movl    $1, %edx
        movl    $.LC0, %ecx      # with -fPIE, uses    leaq    .LC0(%rip), %rcx  like you'd want for Windows x64
        call    printf
        call    bar()
        movl    $1, %eax
        addq    $40, %rsp
        ret

See also How to remove "noise" from GCC/clang assembly output? for more about looking at compiler output - you can answer most questions about how things normally work by looking at what compilers do in practice. Sometimes things compilers do are just a coincidence, especially with optimization disabled (which is why I constructed an example that couldn't inline the functions, so I could still see the calls with optimization enabled). But here we can rule out your alternate hypothesis.

I also constructed this example to show two calls using the same allocation of shadow space, not pointlessly deallocating / reallocating with add/sub. Even with optimization disabled, compilers don't do that.

Re: putting symbol addresses into registers, see How to load address of function or label into register - RIP-relative LEA is the go-to option. It's position-independent, and works in any executable or library smaller than 2GiB of static code+data. And more efficient than movabs.