Why Are Local Variables of Caller Stack Saved in Registers in Callee Stack?

I'm trying my best to learn about the call stack and how stack frames are structured in an ARM Cortex-M0, it's proving to be a little difficult, but with patience I'm learning. I have several questions throughout this one question, so hopefully you guys can help me out in all areas. The questions I have will be highlighted in bold throughout this explanation.

I'm using an ARM Cortex-M0 with GDB and a simply program to debug. Here is my program:

int main(void) {
    static uint16_t myBits;
    myBits = 0x70;

    halInit();

    return 0;
}

I have a breakpoint set on halInit(). I then execute the command info frame on my GDB terminal to get this output:

Stack level 0, frame at 0x20000400:
pc = 0x80000d8 in main (src/main.c:63); saved pc 0x8002dd2
source language c.
Arglist at 0x200003e8, args: 
Locals at 0x200003e8, Previous frame's sp is 0x20000400
Saved registers:
 r0 at 0x200003e8, r1 at 0x200003ec, r4 at 0x200003f0, r5 at 0x200003f4, r6 at 0x200003f8, lr at 0x200003fc

I will explain how I am interpreting this, please let me know if I am correct.

Stack level 0: Current level of the stack frame. 0 will always represent the top of the stack, in other words the current stack frame being used.

frame at 0x20000400: This represents the location of the stack frame in flash memory.

pc = 0x80000d8 in main (src/main.c:63);: This represents the next execution to be executed, i.e. the program counter value. Since the program counter always represents the next instruction to be executed.

saved pc 0x8002dd2: This one is a little confusing to me, but I think it means the return address, essentially the instruction to be executed when it returns from executing the halInit() function. However, if I type the command info reg into my GDB terminal I see that the link register is not this value, but the next address instead: lr 0x8002dd3. Why is that?

source language c.: This represents the language being used.

Arglist at 0x200003e8, args:: This represents the starting address of my arguments that were passed to the stack frame. Since args: is blank, that means no arguments were passed. Which makes since for two reasons: this is the first stack frame in the call stack and my function doesn't have any arguments int main(void).

Locals at 0x200003e8: This is the starting address of my local variables. As you can see in my original code snippet, I should have one local variables myBits. We'll come back to that later.

Previous frame's sp is 0x20000400: This is the stack pointer which points to the top of the callers stack frame. Since this is the first stack frame, I expect this value should equal the current frame's address which it does.

Saved registers:
r0 at 0x200003e8
r1 at 0x200003ec
r4 at 0x200003f0
r5 at 0x200003f4
r6 at 0x200003f8
lr at 0x200003fc

These are registers that have been pushed to the stack to be saved for use later by the current stack frame. This part I am curious about because it's the first stack frame so why would it save so many registers? If I execute the command info reg I get the following output:

r0             0x20000428   0x20000428
r1             0x0  0x0
r2             0x0  0x0
r3             0x70 0x70
r4             0x80000c4    0x80000c4
r5             0x20000700   0x20000700
r6             0xffffffff   0xffffffff
r7             0xffffffff   0xffffffff
r8             0xffffffff   0xffffffff
r9             0xffffffff   0xffffffff
r10            0xffffffff   0xffffffff
r11            0xffffffff   0xffffffff
r12            0xffffffff   0xffffffff
sp             0x200003e8   0x200003e8
lr             0x8002dd3    0x8002dd3
pc             0x80000d8    0x80000d8 <main+8>
xPSR           0x21000000   0x21000000

This tells me that if I check the values stored in each of the memory addresses of the saved registers by executing the command p/x *(register), then the values should be equal to that of the values shown in the output above.

Saved registers:
r0 at 0x200003e8 -> 0x20000428
r1 at 0x200003ec -> 0x0
r4 at 0x200003f0 -> 0x80000c4
r5 at 0x200003f4 -> 0xffffffff
r6 at 0x200003f8 -> 0xffffffff
lr at 0x200003fc -> 0x8002dd3

It works, the values in each address represent the values shown by the info reg command. However, I notice one thing. I have one local variable myBits with a value of 0x70 and this appears to be stored in r3. However r3 is not pushed to the stack for saving.

If we step into the next instruction, a new stack frame is created for the function halInit(). This is shown by executing the command bt on my terminal. It generates the following output:

#0  halInit () at src/hal/src/hal.c:70
#1  0x080000dc in main () at src/main.c:63

If I execute the command info frame then I get the following output:

Stack level 0, frame at 0x200003e8:
pc = 0x8001842 in halInit (src/hal/src/hal.c:70); saved pc 0x80000dc
called by frame at 0x20000400
source language c.
Arglist at 0x200003e0, args: 
Locals at 0x200003e0, Previous frame's sp is 0x200003e8
Saved registers:
 r3 at 0x200003e0, lr at 0x200003e4

Now we see that register r3 was pushed onto this stack frame. This register holds the value of the variable myBits. Why is r3 pushed onto this stack frame if the caller stack frame is what needs this register?

Sorry for the long post, I just want to cover all areas of required information.

Update

I think I might know why r3 was pushed onto the callee stack and not onto the caller stack even though the caller is the one that needs this value.

Is it because the function halInit() will be modifying the value in r3?

In other words, the callee stack frame knows that the caller stack frame requires this register value, so it will push it onto its own stack frame so that it can modify r3 for its own purpose, then when the stack frame is popped it will restore the value 0x70 that was pushed onto the stack frame back into r3 for the caller to use again. Is this correct and if so, how did the callee stack frame know that the caller stack frame will need this value?

Solution

I'm trying my best to learn about the call stack and how stack frames are structured in an ARM Cortex-M0

So based on that quote, first off the ARM cortex-m0 does not have stack frames, processors are really really dumb logic. The compiler generates stack frames which are a compiler thing, not an instruction set thing. The notion of a function is a compiler thing not really anything lower. A compiler uses a calling convention or some basic set of rules designed so that for that language the caller and callee functions know exactly where parameters are, return values, and nobody trashes the others data.

The compiler authors are free to do whatever they want so long as it works and fits withing the rules of the instruction set, as in the logic not assembly language. (An assembler author is free to make up whatever assembly language they want, mnemonics whatever so long as the machine code conforms to the rules of the logic). And they used to do that, the processor vendors have started making recommendations let's say, and the compilers are conforming to them. It's not about sharing objects across compilers as much as it is 1) I don't have to come up with my own 2) we are trusting the IP vendor with their processor and hope that their calling convention was designed for performance and other reasons that we desire.

gcc so far has attempted to conform with ARM's ABI as it evolves and gcc evolves.

When you have "many" registers, what many means is a matter of opinion, but you will see that the convention will use registers first then the stack for passed parameters. You will also see that some registers will be designated as volatile within a function to improve performance over having to use memory (the stack) so much.

By using a debugger and a breakpoint you are looking in the wrong place your statement was you want to understand about the call stack and stack frames which is a compiler thing, not about how exceptions are handled in the logic. Unless that is really what you were after your question wasn't accurate enough to understand.

Compilers like GCC have optimizers and despite them creating confusion with respect to dead code learning from the optimized version is easier than the non-optimized version. Let's dive in

extern unsigned int more_fun ( unsigned int, unsigned int );
unsigned int fun ( unsigned int a, unsigned int b )
{
    return(a+b);
}

Optimized

 <fun>:
   0:   1840        adds    r0, r0, r1
   2:   4770        bx  lr

not

00000000 <fun>:
   0:   b580        push    {r7, lr}
   2:   b082        sub sp, #8
   4:   af00        add r7, sp, #0
   6:   6078        str r0, [r7, #4]
   8:   6039        str r1, [r7, #0]
   a:   687a        ldr r2, [r7, #4]
   c:   683b        ldr r3, [r7, #0]
   e:   18d3        adds    r3, r2, r3
  10:   0018        movs    r0, r3
  12:   46bd        mov sp, r7
  14:   b002        add sp, #8
  16:   bd80        pop {r7, pc}

First off why is the function at address zero? Because I disassembled the object not a linked binary, maybe I will later. And why disassemble vs compile to assembly? If the disassembler is any good, then you actually get to see what was produced rather than the assembly which will contain, certainly with compiled code, a lot of non-instruction language as well as pseudo code that gets changed when finally assembled.

A stack frame IMO is when there is a second pointer, a frame pointer. You often see this with instruction sets that have instructions or limitations that lean toward this. For example an instruction set might have a stack pointer register but you cant address from it, there may be another frame register pointer and that you can. So the typical entry would be to save the frame pointer on the stack because the caller may have been using it for their frame and we want to return it as found, then copy the address of the stack pointer to the frame pointer, then move the stack pointer as far as needed for this function so that interrupts or calls to other functions the stack pointer is on the boundary between used and unused stack space, as it should be at all times. The frame pointer would be used in this case to access any passed in parameters or return addresses in a frame pointer plus offset fashion (for downward growing stacks) and in the negative offset direction for local data.

Now it does look like the compiler is using a frame pointer, what a waste, let's ask it not to:

00000000 <fun>:
   0:   b082        sub sp, #8
   2:   9001        str r0, [sp, #4]
   4:   9100        str r1, [sp, #0]
   6:   9a01        ldr r2, [sp, #4]
   8:   9b00        ldr r3, [sp, #0]
   a:   18d3        adds    r3, r2, r3
   c:   0018        movs    r0, r3
   e:   b002        add sp, #8
  10:   4770        bx  lr

First off, the compiler determined there were 8 bytes of things to save on the stack. Unoptimized pretty much everything gets a place on the stack, the passed parameters as well as local variables, there weren't any locals in this case so we just have the passed in ones, two 32 bit numbers, so 8 bytes. The calling convention used attempts to use r0 for the first parameter and r1 for the second if they fit, in this case they do. so the stack frame is formed when 8 is subtracted from the stack pointer, the stack frame pointer is the stack pointer in this case.

The calling convention used here allows for r0-r3 to be volatile in the function. The compiler does not have to return to the caller with those registers as they were found, they can be used within the function at will. The compiler chose in this case to pull from the stack the addition operands using the next to registers rather than the first to free ones. Once r0 and r1 are saved to the stack then the "pool" of free registers one would assume start with r0,r1,r2,r3. So yes it does appear to be broken, but it is what it is, it is functionally correct and that is the job of a compiler to produce code that functionally implements the compiled code. The calling convention used by this compiler states that the return value goes in r0 if it fits, which it does.

So the stack frame is setup, 8 is subtracted from sp. Passed in parameters are saved to the stack. Now the function starts by pulling the passed in parameters from the stack, adding them, and placing the result in the return register.

Then bx lr is used to return, look that instruction up along with pop (for armv6m, for armv4t pop can't be used to switch modes so compilers will if they can pop to lr then bx lr).

armv4t thumb, can't use pop to return in case this code is mixed with arm, so the return pops into a volatile register and does a bx lr, you can't pop directly into lr in thumb. It is possible that you might be able to tell the compiler I am not mixing this with ARM code so it's safe to use pop to return. Depends on the compiler.

00000000 <fun>:
   0:   b580        push    {r7, lr}
   2:   b082        sub sp, #8
   4:   af00        add r7, sp, #0
   6:   6078        str r0, [r7, #4]
   8:   6039        str r1, [r7, #0]
   a:   687a        ldr r2, [r7, #4]
   c:   683b        ldr r3, [r7, #0]
   e:   18d3        adds    r3, r2, r3
  10:   0018        movs    r0, r3
  12:   46bd        mov sp, r7
  14:   b002        add sp, #8
  16:   bc80        pop {r7}
  18:   bc02        pop {r1}
  1a:   4708        bx  r1

to see a frame pointer

00000000 <fun>:
   0:   b580        push    {r7, lr}
   2:   b082        sub sp, #8
   4:   af00        add r7, sp, #0
   6:   6078        str r0, [r7, #4]
   8:   6039        str r1, [r7, #0]
   a:   687a        ldr r2, [r7, #4]
   c:   683b        ldr r3, [r7, #0]
   e:   18d3        adds    r3, r2, r3
  10:   0018        movs    r0, r3
  12:   46bd        mov sp, r7
  14:   b002        add sp, #8
  16:   bd80        pop {r7, pc}

First off you save the frame pointer to the stack as the caller or the caller's caller, etc may be using it, it's a register we have to preserve. Now some calling convention comes into play right off the start. We know that the compiler knows that we are not calling another function so we don't need to preserve the return address (stored in the link register r14), so why push it on the stack why waste the space and the clock cycles? Well, the convention changed not long ago to say the stack should be 64 bit aligned, so you basically push and pop in pairs of registers (an even number of registers). Sometimes they use more than one instruction for a pair as we see in the armv4t return.

So the compiler needed to push another register, it could and you will see sometimes that it does just pick some register it is not using and push that on the stack, maybe we can get that to do this here in a bit. In this case being armv6-m you can switch modes with a pop so it is safe to generate a return using a pop pc, so you save an instruction by using the link register here instead of some other register. A little optimization despite being unoptimized code.

Save the frame pointer then associate the frame pointer with the stack pointer, in this case it moves the stack pointer first and makes the frame pointer match the stack pointer then uses the frame pointer for stack accesses. Oh how wasteful, even for unoptimized code. But perhaps this compiler defaults to a frame pointer when told to compile like this.

While here one of your questions and I have commented on this thus far indirectly. The full sized ARM processors armv4t through armv7 support both ARM instructions and thumb instructions. Not everyone supports every one there was an evolution, but you can have ARM and thumb instructions coexist as part of the rules defined by the logic for that core. The ARM design to support this is since ARM instructions have to be word aligned, the lower two bits of the address of an ARM instruction are always zeros. A desired 16 bit instruction set, also aligned, would always have the lower bit of the address zero. So why not use the lsbit of the address as a way to switch modes. And that is what they chose to do. With a few instructions at first, then became more that are allowed by the armv7 architecture, if the address of the branch (look up bx first, branch exchange) has an lsbit of 1 then the processor switches to thumb mode when it begins to fetch instructions at that address, the program counter does not retain this one, it is stripped by the instruction, it is just a signal used to tell the instruction to switch modes. if the lsbit is a 0 then the processor switches to ARM mode. If it was already in the said mode it just stays in that mode.

Now comes these cortex-m cores which are thumb only machines, no ARM mode. The tools are in place, it all works no reason to change, if you try to go into ARM mode on a cortex-m you get a fault.

Now look at the code above, sometimes we return with a bx lr and sometimes a pop pc, in both cases lr held the "return address". for the bx lr case to have worked the lsbit of lr must be set. The caller can't know which instruction we are going to use for the return, and the caller doesn't have to but likely used a bl to make the call so the logic actually set the bit not the compiler. That is why your return address is off by one byte.

If you want to learn about compilers and stack frames though, while unoptimized definitely uses the stack as you can see, optimized code if you have a compiler with decent optimization can be easier to understand the compilers output once you learn not to make dead code.

00000000 <fun>:
   0:   1840        adds    r0, r0, r1
   2:   4770        bx  lr

r0 and r1 are the passed in parameters, r0 is where the return value goes, link register is the return address. This is what you would hope a compiler would produce for a function like that.

So now let's try something more complicated.

extern unsigned int more_fun ( unsigned int, unsigned int );
unsigned int fun ( unsigned int a, unsigned int b )
{
    return(more_fun(a,b));
}

00000000 <fun>:
   0:   b510        push    {r4, lr}
   2:   f7ff fffe   bl  0 <more_fun>
   6:   bd10        pop {r4, pc}

A few things things, first why didn't the optimizer do this:

fun:
   b more_fun

I don't know.

Why does it say bl 0, more fun is not at zero? This is an object not linked code, once linked the linker will modify that bl instruction to point at more_fun().

Third we already got the compiler to push a register we didn't use. It is pushing and popping r4 so that it can keep the stack aligned per the calling convention used by this compiler. It could have chosen almost any one of the registers, and you may find a gcc or llvm/clang version that uses say r3 instead of r4. gcc has been using r4 for a bit now. It's the first in the list of registers you have to preserve first in the list of registers that if they want to preserve something across a call they will use (as we will see in a second). so perhaps that's why, who knows ask the author.

extern unsigned int more_fun ( unsigned int, unsigned int );
unsigned int fun ( unsigned int a, unsigned int b )
{
    more_fun(a,b);
    return(a);
}

00000000 <fun>:
   0:   b510        push    {r4, lr}
   2:   0004        movs    r4, r0
   4:   f7ff fffe   bl  0 <more_fun>
   8:   0020        movs    r0, r4
   a:   bd10        pop {r4, pc}

Now we are making progress. So we tell the compiler it has to save the passed in parameter across a function call. Each function starts the rules over, so each function called can trash r0-r3, so if you are using r0-r3 for something you need to save them somewhere. So a very wise choice, instead of saving the passed in parameter on the stack and possibly having to do multiple costly memory cycles to access it. Instead save a callee or callee's callee, etc value on the stack and use a register within our function to save that parameter, as a design it saves a lot of wasted cycles. We needed the stack to be aligned anyway so this all worked out preserve r4 and save the return address since we are making a call ourselves which will trash it. Save the parameter we need after the call into r4. Make the call place the return value in the return register and return. Cleaning up the stack as you go. So the stack frame here is minimal if at all. Not using the stack much.

extern unsigned int more_fun ( unsigned int, unsigned int );
unsigned int fun ( unsigned int a, unsigned int b )
{
    b<<=more_fun(a,b);
    return(a+b);
}

00000000 <fun>:
   0:   b570        push    {r4, r5, r6, lr}
   2:   0005        movs    r5, r0
   4:   000c        movs    r4, r1
   6:   f7ff fffe   bl  0 <more_fun>
   a:   4084        lsls    r4, r0
   c:   1960        adds    r0, r4, r5
   e:   bd70        pop {r4, r5, r6, pc}

we did it again we got the compiler to have to save a register we didn't use to keep the alignment. And we are using more of the stack but would you call that a stack frame? We forced the compiler to have to preserve both incoming parameters through a subroutine call.

extern unsigned int more_fun ( unsigned int, unsigned int );
unsigned int fun ( unsigned int a, unsigned int b, unsigned int c, unsigned int d )
{
    b<<=more_fun(b,c);
    c<<=more_fun(c,d);
    d<<=more_fun(b,d);
    return(a+b+c+d);
}


 0: b5f8        push    {r3, r4, r5, r6, r7, lr}
   2:   000c        movs    r4, r1
   4:   0007        movs    r7, r0
   6:   0011        movs    r1, r2
   8:   0020        movs    r0, r4
   a:   001d        movs    r5, r3
   c:   0016        movs    r6, r2
   e:   f7ff fffe   bl  0 <more_fun>
  12:   0029        movs    r1, r5
  14:   4084        lsls    r4, r0
  16:   0030        movs    r0, r6
  18:   f7ff fffe   bl  0 <more_fun>
  1c:   0029        movs    r1, r5
  1e:   4086        lsls    r6, r0
  20:   0020        movs    r0, r4
  22:   f7ff fffe   bl  0 <more_fun>
  26:   4085        lsls    r5, r0
  28:   19a4        adds    r4, r4, r6
  2a:   19e4        adds    r4, r4, r7
  2c:   1960        adds    r0, r4, r5
  2e:   bdf8        pop {r3, r4, r5, r6, r7, pc}

What is it going to take? we at least did get it to save r3 to even out the stack. I bet we can push it now...

extern unsigned int more_fun ( unsigned int, unsigned int );
unsigned int fun ( unsigned int a, unsigned int b, unsigned int c, unsigned int d, unsigned int e, unsigned int f )
{
    b<<=more_fun(b,c);
    c<<=more_fun(c,d);
    d<<=more_fun(b,d);
    e<<=more_fun(e,d);
    f<<=more_fun(e,f);
    return(a+b+c+d+e+f);
}

00000000 <fun>:
   0:   b5f0        push    {r4, r5, r6, r7, lr}
   2:   46c6        mov lr, r8
   4:   000c        movs    r4, r1
   6:   b500        push    {lr}
   8:   0011        movs    r1, r2
   a:   0007        movs    r7, r0
   c:   0020        movs    r0, r4
   e:   0016        movs    r6, r2
  10:   001d        movs    r5, r3
  12:   f7ff fffe   bl  0 <more_fun>
  16:   0029        movs    r1, r5
  18:   4084        lsls    r4, r0
  1a:   0030        movs    r0, r6
  1c:   f7ff fffe   bl  0 <more_fun>
  20:   0029        movs    r1, r5
  22:   4086        lsls    r6, r0
  24:   0020        movs    r0, r4
  26:   f7ff fffe   bl  0 <more_fun>
  2a:   4085        lsls    r5, r0
  2c:   9806        ldr r0, [sp, #24]
  2e:   0029        movs    r1, r5
  30:   f7ff fffe   bl  0 <more_fun>
  34:   9b06        ldr r3, [sp, #24]
  36:   9907        ldr r1, [sp, #28]
  38:   4083        lsls    r3, r0
  3a:   0018        movs    r0, r3
  3c:   4698        mov r8, r3
  3e:   f7ff fffe   bl  0 <more_fun>
  42:   9b07        ldr r3, [sp, #28]
  44:   19a4        adds    r4, r4, r6
  46:   4083        lsls    r3, r0
  48:   19e4        adds    r4, r4, r7
  4a:   1964        adds    r4, r4, r5
  4c:   4444        add r4, r8
  4e:   18e0        adds    r0, r4, r3
  50:   bc04        pop {r2}
  52:   4690        mov r8, r2
  54:   bdf0        pop {r4, r5, r6, r7, pc}
  56:   46c0        nop         ; (mov r8, r8)

Okay that's that is how it is going to be...

extern unsigned int more_fun ( unsigned int, unsigned int );
extern void not_dead ( unsigned int *);
unsigned int fun ( unsigned int a, unsigned int b )
{
    unsigned int x[16];
    unsigned int ra;
    for(ra=0;ra<16;ra++)
    {
        x[ra]=more_fun(a+ra,b);
    }
    not_dead(x);
    return(ra);
}


00000000 <fun>:
   0:   b5f0        push    {r4, r5, r6, r7, lr}
   2:   0006        movs    r6, r0
   4:   b091        sub sp, #68 ; 0x44
   6:   0004        movs    r4, r0
   8:   000f        movs    r7, r1
   a:   466d        mov r5, sp
   c:   3610        adds    r6, #16
   e:   0020        movs    r0, r4
  10:   0039        movs    r1, r7
  12:   f7ff fffe   bl  0 <more_fun>
  16:   3401        adds    r4, #1
  18:   c501        stmia   r5!, {r0}
  1a:   42b4        cmp r4, r6
  1c:   d1f7        bne.n   e <fun+0xe>
  1e:   4668        mov r0, sp
  20:   f7ff fffe   bl  0 <not_dead>
  24:   2010        movs    r0, #16
  26:   b011        add sp, #68 ; 0x44
  28:   bdf0        pop {r4, r5, r6, r7, pc}
  2a:   46c0        nop         ; (mov r8, r8)

And there is your stack frame but it doesn't really have a frame pointer and doesn't use the stack to access stuff. Would have to keep working harder to see that, very doable. But hopefully by now you see my point. Your question is about stack frames are structured in compiled code, in particular how a compiler might implement that for a particular target.

Incidentally, this is what clang did with that code.

00000000 <fun>:
   0:   b5b0        push    {r4, r5, r7, lr}
   2:   af02        add r7, sp, #8
   4:   b090        sub sp, #64 ; 0x40
   6:   460c        mov r4, r1
   8:   4605        mov r5, r0
   a:   f7ff fffe   bl  0 <more_fun>
   e:   9000        str r0, [sp, #0]
  10:   1c68        adds    r0, r5, #1
  12:   4621        mov r1, r4
  14:   f7ff fffe   bl  0 <more_fun>
  18:   9001        str r0, [sp, #4]
  1a:   1ca8        adds    r0, r5, #2
  1c:   4621        mov r1, r4
  1e:   f7ff fffe   bl  0 <more_fun>
  22:   9002        str r0, [sp, #8]
  24:   1ce8        adds    r0, r5, #3
  26:   4621        mov r1, r4
  28:   f7ff fffe   bl  0 <more_fun>
  2c:   9003        str r0, [sp, #12]
  2e:   1d28        adds    r0, r5, #4
  30:   4621        mov r1, r4
  32:   f7ff fffe   bl  0 <more_fun>
  36:   9004        str r0, [sp, #16]
  38:   1d68        adds    r0, r5, #5
  3a:   4621        mov r1, r4
  3c:   f7ff fffe   bl  0 <more_fun>
  40:   9005        str r0, [sp, #20]
  42:   1da8        adds    r0, r5, #6
  44:   4621        mov r1, r4
  46:   f7ff fffe   bl  0 <more_fun>
  4a:   9006        str r0, [sp, #24]
  4c:   1de8        adds    r0, r5, #7
  4e:   4621        mov r1, r4
  50:   f7ff fffe   bl  0 <more_fun>
  54:   9007        str r0, [sp, #28]
  56:   4628        mov r0, r5
  58:   3008        adds    r0, #8
  5a:   4621        mov r1, r4
  5c:   f7ff fffe   bl  0 <more_fun>
  60:   9008        str r0, [sp, #32]
  62:   4628        mov r0, r5
  64:   3009        adds    r0, #9
  66:   4621        mov r1, r4
  68:   f7ff fffe   bl  0 <more_fun>
  6c:   9009        str r0, [sp, #36]   ; 0x24
  6e:   4628        mov r0, r5
  70:   300a        adds    r0, #10
  72:   4621        mov r1, r4
  74:   f7ff fffe   bl  0 <more_fun>
  78:   900a        str r0, [sp, #40]   ; 0x28
  7a:   4628        mov r0, r5
  7c:   300b        adds    r0, #11
  7e:   4621        mov r1, r4
  80:   f7ff fffe   bl  0 <more_fun>
  84:   900b        str r0, [sp, #44]   ; 0x2c
  86:   4628        mov r0, r5
  88:   300c        adds    r0, #12
  8a:   4621        mov r1, r4
  8c:   f7ff fffe   bl  0 <more_fun>
  90:   900c        str r0, [sp, #48]   ; 0x30
  92:   4628        mov r0, r5
  94:   300d        adds    r0, #13
  96:   4621        mov r1, r4
  98:   f7ff fffe   bl  0 <more_fun>
  9c:   900d        str r0, [sp, #52]   ; 0x34
  9e:   4628        mov r0, r5
  a0:   300e        adds    r0, #14
  a2:   4621        mov r1, r4
  a4:   f7ff fffe   bl  0 <more_fun>
  a8:   900e        str r0, [sp, #56]   ; 0x38
  aa:   350f        adds    r5, #15
  ac:   4628        mov r0, r5
  ae:   4621        mov r1, r4
  b0:   f7ff fffe   bl  0 <more_fun>
  b4:   900f        str r0, [sp, #60]   ; 0x3c
  b6:   4668        mov r0, sp
  b8:   f7ff fffe   bl  0 <not_dead>
  bc:   2010        movs    r0, #16
  be:   b010        add sp, #64 ; 0x40
  c0:   bdb0        pop {r4, r5, r7, pc}

Now you used the term call stack. The calling convention used by this compiler says that use r0-r3 when possible to pass in the first parameters then use the stack after that.

unsigned int fun ( unsigned int a, unsigned int b, unsigned int c, unsigned int d, unsigned int e )
{
    return(a+b+c+d+e);
}
00000000 <fun>:
   0:   b510        push    {r4, lr}
   2:   9c02        ldr r4, [sp, #8]
   4:   46a4        mov r12, r4
   6:   4463        add r3, r12
   8:   189b        adds    r3, r3, r2
   a:   185b        adds    r3, r3, r1
   c:   1818        adds    r0, r3, r0
   e:   bd10        pop {r4, pc}

So having more than four parameters the first four are in r0-r3 and then the "call stack" assuming that is what you were referring to is the fifth parameter. The thumb instruction set uses bl as its main call instruction which uses r14 as the return address, unlike other instruction sets that might use the stack to store the return address, ARM uses a register. And the popular ARM calling conventions use registers for the first few operands then use the stack after that.

You would want to look at other instruction sets to see more of a call stack

00000000 <_fun>:
   0:   1d80 0008       mov 10(sp), r0
   4:   6d80 000a       add 12(sp), r0
   8:   6d80 0006       add 6(sp), r0
   c:   6d80 0004       add 4(sp), r0
  10:   6d80 0002       add 2(sp), r0
  14:   0087            rts pc