Search code examples
assemblycpucallstacklow-levelcalling-convention

How exactly does the callstack work?


I'm trying to get a deeper understanding of how the low level operations of programming languages work and especially how they interact with the OS/CPU. I've probably read every answer in every stack/heap related thread here on Stack Overflow, and they are all brilliant. But there is still one thing that I didn't fully understand yet.

Consider this function in pseudo code which tends to be valid Rust code ;-)

fn foo() {
    let a = 1;
    let b = 2;
    let c = 3;
    let d = 4;

    // line X

    doSomething(a, b);
    doAnotherThing(c, d);
}

This is how I assume the stack to look like on line X:

Stack

a +-------------+
  | 1           | 
b +-------------+     
  | 2           |  
c +-------------+
  | 3           | 
d +-------------+     
  | 4           | 
  +-------------+ 

Now, everything I've read about how the stack works is that it strictly obeys LIFO rules (last in, first out). Just like a stack datatype in .NET, Java or any other programming language.

But if that's the case, then what happens after line X? Because obviously, the next thing we need is to work with a and b, but that would mean that the OS/CPU (?) has to pop out d and c first to get back to a and b. But then it would shoot itself in the foot, because it needs c and d in the next line.

So, I wonder what exactly happens behind the scenes?

Another related question. Consider we pass a reference to one of the other functions like this:

fn foo() {
    let a = 1;
    let b = 2;
    let c = 3;
    let d = 4;

    // line X

    doSomething(&a, &b);
    doAnotherThing(c, d);
}

From how I understand things, this would mean that the parameters in doSomething are essentially pointing to the same memory address like a and b in foo. But then again this means that there is no pop up the stack until we get to a and b happening.

Those two cases make me think that I haven't fully grasped how exactly the stack works and how it strictly follows the LIFO rules.


Solution

  • The call stack could also be called a frame stack.
    The things that are stacked after the LIFO principle are not the local variables but the entire stack frames ("calls") of the functions being called. The local variables are pushed and popped together with those frames in the so-called function prologue and epilogue, respectively.

    Inside the frame the order of the variables is completely unspecified; Compilers "reorder" the positions of local variables inside a frame appropriately to optimize their alignment so the processor can fetch them as quickly as possible. The crucial fact is that the offset of the variables relative to some fixed address is constant throughout the lifetime of the frame - so it suffices to take an anchor address, say, the address of the frame itself, and work with offsets of that address to the variables. Such an anchor address is actually contained in the so-called base or frame pointer which is stored in the EBP register. The offsets, on the other hand, are clearly known at compile time and are therefore hardcoded into the machine code.

    This graphic from Wikipedia shows what the typical call stack is structured like1:

    Picture of a stack

    Add the offset of a variable we want to access to the address contained in the frame pointer and we get the address of our variable. So shortly said, the code just accesses them directly via constant compile-time offsets from the base pointer; It's simple pointer arithmetic.

    Example

    #include <iostream>
    
    int main()
    {
        char c = std::cin.get();
        std::cout << c;
    }
    

    gcc.godbolt.org gives us

    main:
        pushq   %rbp
        movq    %rsp, %rbp
        subq    $16, %rsp
    
        movl    std::cin, %edi
        call    std::basic_istream<char, std::char_traits<char> >::get()
        movb    %al, -1(%rbp)
        movsbl  -1(%rbp), %eax
        movl    %eax, %esi
        movl    std::cout, %edi
        call    [... the insertion operator for char, long thing... ]
    
        movl    $0, %eax
        leave
        ret
    

    .. for main. I divided the code into three subsections. The function prologue consists of the first three operations:

    • Base pointer is pushed onto the stack.
    • The stack pointer is saved in the base pointer
    • The stack pointer is subtracted to make room for local variables.

    Then cin is moved into the EDI register2 and get is called; The return value is in EAX.

    So far so good. Now the interesting thing happens:

    The low-order byte of EAX, designated by the 8-bit register AL, is taken and stored in the byte right after the base pointer: That is -1(%rbp), the offset of the base pointer is -1. This byte is our variable c. The offset is negative because the stack grows downwards on x86. The next operation stores c in EAX: EAX is moved to ESI, cout is moved to EDI and then the insertion operator is called with cout and c being the arguments.

    Finally,

    • The return value of main is stored in EAX: 0. That is because of the implicit return statement. You might also see xorl rax rax instead of movl.
    • leave and return to the call site. leave is abbreviating this epilogue and implicitly
      • Replaces the stack pointer with the base pointer and
      • Pops the base pointer.

    After this operation and ret have been performed, the frame has effectively been popped, although the caller still has to clean up the arguments as we're using the cdecl calling convention. Other conventions, e.g. stdcall, require the callee to tidy up, e.g. by passing the amount of bytes to ret.

    Frame Pointer Omission

    It is also possible not to use offsets from the base/frame pointer but from the stack pointer (ESB) instead. This makes the EBP-register that would otherwise contain the frame pointer value available for arbitrary use - but it can make debugging impossible on some machines, and will be implicitly turned off for some functions. It is particularly useful when compiling for processors with only few registers, including x86.

    This optimization is known as FPO (frame pointer omission) and set by -fomit-frame-pointer in GCC and -Oy in Clang; note that it is implicitly triggered by every optimization level > 0 if and only if debugging is still possible, since it doesn't have any costs apart from that. For further information see here and here.


    1 As pointed out in the comments, the frame pointer is presumably meant to point to the address after the return address.

    2 Note that the registers that start with R are the 64-bit counterparts of the ones that start with E. EAX designates the four low-order bytes of RAX. I used the names of the 32-bit registers for clarity.