I'm trying my best to learn about the call stack and how stack frames are structured in an ARM Cortex-M0, it's proving to be a little difficult, but with patience I'm learning. I have several questions throughout this one question, so hopefully you guys can help me out in all areas. The questions I have will be highlighted in bold throughout this explanation.
I'm using an ARM Cortex-M0 with GDB and a simply program to debug. Here is my program:
int main(void) {
static uint16_t myBits;
myBits = 0x70;
halInit();
return 0;
}
I have a breakpoint set on halInit()
. I then execute the command info frame
on my GDB terminal to get this output:
Stack level 0, frame at 0x20000400:
pc = 0x80000d8 in main (src/main.c:63); saved pc 0x8002dd2
source language c.
Arglist at 0x200003e8, args:
Locals at 0x200003e8, Previous frame's sp is 0x20000400
Saved registers:
r0 at 0x200003e8, r1 at 0x200003ec, r4 at 0x200003f0, r5 at 0x200003f4, r6 at 0x200003f8, lr at 0x200003fc
I will explain how I am interpreting this, please let me know if I am correct.
Stack level 0
: Current level of the stack frame. 0
will always represent the top of the stack, in other words the current stack frame being used.
frame at 0x20000400
: This represents the location of the stack frame in flash memory.
pc = 0x80000d8 in main (src/main.c:63);
: This represents the next execution to be executed, i.e. the program counter value. Since the program counter always represents the next instruction to be executed.
saved pc 0x8002dd2
: This one is a little confusing to me, but I think it means the return address, essentially the instruction to be executed when it returns from executing the halInit()
function. However, if I type the command info reg
into my GDB terminal I see that the link register is not this value, but the next address instead: lr 0x8002dd3
. Why is that?
source language c.
: This represents the language being used.
Arglist at 0x200003e8, args:
: This represents the starting address of my arguments that were passed to the stack frame. Since args:
is blank, that means no arguments were passed. Which makes since for two reasons: this is the first stack frame in the call stack and my function doesn't have any arguments int main(void)
.
Locals at 0x200003e8
: This is the starting address of my local variables. As you can see in my original code snippet, I should have one local variables myBits
. We'll come back to that later.
Previous frame's sp is 0x20000400
: This is the stack pointer which points to the top of the callers stack frame. Since this is the first stack frame, I expect this value should equal the current frame's address which it does.
Saved registers:
r0 at 0x200003e8
r1 at 0x200003ec
r4 at 0x200003f0
r5 at 0x200003f4
r6 at 0x200003f8
lr at 0x200003fc
These are registers that have been pushed to the stack to be saved for use later by the current stack frame. This part I am curious about because it's the first stack frame so why would it save so many registers? If I execute the command info reg
I get the following output:
r0 0x20000428 0x20000428
r1 0x0 0x0
r2 0x0 0x0
r3 0x70 0x70
r4 0x80000c4 0x80000c4
r5 0x20000700 0x20000700
r6 0xffffffff 0xffffffff
r7 0xffffffff 0xffffffff
r8 0xffffffff 0xffffffff
r9 0xffffffff 0xffffffff
r10 0xffffffff 0xffffffff
r11 0xffffffff 0xffffffff
r12 0xffffffff 0xffffffff
sp 0x200003e8 0x200003e8
lr 0x8002dd3 0x8002dd3
pc 0x80000d8 0x80000d8 <main+8>
xPSR 0x21000000 0x21000000
This tells me that if I check the values stored in each of the memory addresses of the saved registers by executing the command p/x *(register)
, then the values should be equal to that of the values shown in the output above.
Saved registers:
r0 at 0x200003e8 -> 0x20000428
r1 at 0x200003ec -> 0x0
r4 at 0x200003f0 -> 0x80000c4
r5 at 0x200003f4 -> 0xffffffff
r6 at 0x200003f8 -> 0xffffffff
lr at 0x200003fc -> 0x8002dd3
It works, the values in each address represent the values shown by the info reg
command. However, I notice one thing. I have one local variable myBits
with a value of 0x70
and this appears to be stored in r3
. However r3
is not pushed to the stack for saving.
If we step into the next instruction, a new stack frame is created for the function halInit()
. This is shown by executing the command bt
on my terminal. It generates the following output:
#0 halInit () at src/hal/src/hal.c:70
#1 0x080000dc in main () at src/main.c:63
If I execute the command info frame
then I get the following output:
Stack level 0, frame at 0x200003e8:
pc = 0x8001842 in halInit (src/hal/src/hal.c:70); saved pc 0x80000dc
called by frame at 0x20000400
source language c.
Arglist at 0x200003e0, args:
Locals at 0x200003e0, Previous frame's sp is 0x200003e8
Saved registers:
r3 at 0x200003e0, lr at 0x200003e4
Now we see that register r3
was pushed onto this stack frame. This register holds the value of the variable myBits
. Why is r3
pushed onto this stack frame if the caller stack frame is what needs this register?
Sorry for the long post, I just want to cover all areas of required information.
I think I might know why r3
was pushed onto the callee stack and not onto the caller stack even though the caller is the one that needs this value.
Is it because the function halInit()
will be modifying the value in r3
?
In other words, the callee stack frame knows that the caller stack frame requires this register value, so it will push it onto its own stack frame so that it can modify r3
for its own purpose, then when the stack frame is popped it will restore the value 0x70
that was pushed onto the stack frame back into r3
for the caller to use again. Is this correct and if so, how did the callee stack frame know that the caller stack frame will need this value?
I'm trying my best to learn about the call stack and how stack frames are structured in an ARM Cortex-M0
So based on that quote, first off the ARM cortex-m0 does not have stack frames, processors are really really dumb logic. The compiler generates stack frames which are a compiler thing, not an instruction set thing. The notion of a function is a compiler thing not really anything lower. A compiler uses a calling convention or some basic set of rules designed so that for that language the caller and callee functions know exactly where parameters are, return values, and nobody trashes the others data.
The compiler authors are free to do whatever they want so long as it works and fits withing the rules of the instruction set, as in the logic not assembly language. (An assembler author is free to make up whatever assembly language they want, mnemonics whatever so long as the machine code conforms to the rules of the logic). And they used to do that, the processor vendors have started making recommendations let's say, and the compilers are conforming to them. It's not about sharing objects across compilers as much as it is 1) I don't have to come up with my own 2) we are trusting the IP vendor with their processor and hope that their calling convention was designed for performance and other reasons that we desire.
gcc so far has attempted to conform with ARM's ABI as it evolves and gcc evolves.
When you have "many" registers, what many means is a matter of opinion, but you will see that the convention will use registers first then the stack for passed parameters. You will also see that some registers will be designated as volatile within a function to improve performance over having to use memory (the stack) so much.
By using a debugger and a breakpoint you are looking in the wrong place your statement was you want to understand about the call stack and stack frames which is a compiler thing, not about how exceptions are handled in the logic. Unless that is really what you were after your question wasn't accurate enough to understand.
Compilers like GCC have optimizers and despite them creating confusion with respect to dead code learning from the optimized version is easier than the non-optimized version. Let's dive in
extern unsigned int more_fun ( unsigned int, unsigned int );
unsigned int fun ( unsigned int a, unsigned int b )
{
return(a+b);
}
Optimized
<fun>:
0: 1840 adds r0, r0, r1
2: 4770 bx lr
not
00000000 <fun>:
0: b580 push {r7, lr}
2: b082 sub sp, #8
4: af00 add r7, sp, #0
6: 6078 str r0, [r7, #4]
8: 6039 str r1, [r7, #0]
a: 687a ldr r2, [r7, #4]
c: 683b ldr r3, [r7, #0]
e: 18d3 adds r3, r2, r3
10: 0018 movs r0, r3
12: 46bd mov sp, r7
14: b002 add sp, #8
16: bd80 pop {r7, pc}
First off why is the function at address zero? Because I disassembled the object not a linked binary, maybe I will later. And why disassemble vs compile to assembly? If the disassembler is any good, then you actually get to see what was produced rather than the assembly which will contain, certainly with compiled code, a lot of non-instruction language as well as pseudo code that gets changed when finally assembled.
A stack frame IMO is when there is a second pointer, a frame pointer. You often see this with instruction sets that have instructions or limitations that lean toward this. For example an instruction set might have a stack pointer register but you cant address from it, there may be another frame register pointer and that you can. So the typical entry would be to save the frame pointer on the stack because the caller may have been using it for their frame and we want to return it as found, then copy the address of the stack pointer to the frame pointer, then move the stack pointer as far as needed for this function so that interrupts or calls to other functions the stack pointer is on the boundary between used and unused stack space, as it should be at all times. The frame pointer would be used in this case to access any passed in parameters or return addresses in a frame pointer plus offset fashion (for downward growing stacks) and in the negative offset direction for local data.
Now it does look like the compiler is using a frame pointer, what a waste, let's ask it not to:
00000000 <fun>:
0: b082 sub sp, #8
2: 9001 str r0, [sp, #4]
4: 9100 str r1, [sp, #0]
6: 9a01 ldr r2, [sp, #4]
8: 9b00 ldr r3, [sp, #0]
a: 18d3 adds r3, r2, r3
c: 0018 movs r0, r3
e: b002 add sp, #8
10: 4770 bx lr
First off, the compiler determined there were 8 bytes of things to save on the stack. Unoptimized pretty much everything gets a place on the stack, the passed parameters as well as local variables, there weren't any locals in this case so we just have the passed in ones, two 32 bit numbers, so 8 bytes. The calling convention used attempts to use r0 for the first parameter and r1 for the second if they fit, in this case they do. so the stack frame is formed when 8 is subtracted from the stack pointer, the stack frame pointer is the stack pointer in this case.
The calling convention used here allows for r0-r3 to be volatile in the function. The compiler does not have to return to the caller with those registers as they were found, they can be used within the function at will. The compiler chose in this case to pull from the stack the addition operands using the next to registers rather than the first to free ones. Once r0 and r1 are saved to the stack then the "pool" of free registers one would assume start with r0,r1,r2,r3. So yes it does appear to be broken, but it is what it is, it is functionally correct and that is the job of a compiler to produce code that functionally implements the compiled code. The calling convention used by this compiler states that the return value goes in r0 if it fits, which it does.
So the stack frame is setup, 8 is subtracted from sp. Passed in parameters are saved to the stack. Now the function starts by pulling the passed in parameters from the stack, adding them, and placing the result in the return register.
Then bx lr is used to return, look that instruction up along with pop (for armv6m, for armv4t pop can't be used to switch modes so compilers will if they can pop to lr then bx lr).
armv4t thumb, can't use pop to return in case this code is mixed with arm, so the return pops into a volatile register and does a bx lr, you can't pop directly into lr in thumb. It is possible that you might be able to tell the compiler I am not mixing this with ARM code so it's safe to use pop to return. Depends on the compiler.
00000000 <fun>:
0: b580 push {r7, lr}
2: b082 sub sp, #8
4: af00 add r7, sp, #0
6: 6078 str r0, [r7, #4]
8: 6039 str r1, [r7, #0]
a: 687a ldr r2, [r7, #4]
c: 683b ldr r3, [r7, #0]
e: 18d3 adds r3, r2, r3
10: 0018 movs r0, r3
12: 46bd mov sp, r7
14: b002 add sp, #8
16: bc80 pop {r7}
18: bc02 pop {r1}
1a: 4708 bx r1
to see a frame pointer
00000000 <fun>:
0: b580 push {r7, lr}
2: b082 sub sp, #8
4: af00 add r7, sp, #0
6: 6078 str r0, [r7, #4]
8: 6039 str r1, [r7, #0]
a: 687a ldr r2, [r7, #4]
c: 683b ldr r3, [r7, #0]
e: 18d3 adds r3, r2, r3
10: 0018 movs r0, r3
12: 46bd mov sp, r7
14: b002 add sp, #8
16: bd80 pop {r7, pc}
First off you save the frame pointer to the stack as the caller or the caller's caller, etc may be using it, it's a register we have to preserve. Now some calling convention comes into play right off the start. We know that the compiler knows that we are not calling another function so we don't need to preserve the return address (stored in the link register r14), so why push it on the stack why waste the space and the clock cycles? Well, the convention changed not long ago to say the stack should be 64 bit aligned, so you basically push and pop in pairs of registers (an even number of registers). Sometimes they use more than one instruction for a pair as we see in the armv4t return.
So the compiler needed to push another register, it could and you will see sometimes that it does just pick some register it is not using and push that on the stack, maybe we can get that to do this here in a bit. In this case being armv6-m you can switch modes with a pop so it is safe to generate a return using a pop pc, so you save an instruction by using the link register here instead of some other register. A little optimization despite being unoptimized code.
Save the frame pointer then associate the frame pointer with the stack pointer, in this case it moves the stack pointer first and makes the frame pointer match the stack pointer then uses the frame pointer for stack accesses. Oh how wasteful, even for unoptimized code. But perhaps this compiler defaults to a frame pointer when told to compile like this.
While here one of your questions and I have commented on this thus far indirectly. The full sized ARM processors armv4t through armv7 support both ARM instructions and thumb instructions. Not everyone supports every one there was an evolution, but you can have ARM and thumb instructions coexist as part of the rules defined by the logic for that core. The ARM design to support this is since ARM instructions have to be word aligned, the lower two bits of the address of an ARM instruction are always zeros. A desired 16 bit instruction set, also aligned, would always have the lower bit of the address zero. So why not use the lsbit of the address as a way to switch modes. And that is what they chose to do. With a few instructions at first, then became more that are allowed by the armv7 architecture, if the address of the branch (look up bx first, branch exchange) has an lsbit of 1 then the processor switches to thumb mode when it begins to fetch instructions at that address, the program counter does not retain this one, it is stripped by the instruction, it is just a signal used to tell the instruction to switch modes. if the lsbit is a 0 then the processor switches to ARM mode. If it was already in the said mode it just stays in that mode.
Now comes these cortex-m cores which are thumb only machines, no ARM mode. The tools are in place, it all works no reason to change, if you try to go into ARM mode on a cortex-m you get a fault.
Now look at the code above, sometimes we return with a bx lr and sometimes a pop pc, in both cases lr held the "return address". for the bx lr case to have worked the lsbit of lr must be set. The caller can't know which instruction we are going to use for the return, and the caller doesn't have to but likely used a bl to make the call so the logic actually set the bit not the compiler. That is why your return address is off by one byte.
If you want to learn about compilers and stack frames though, while unoptimized definitely uses the stack as you can see, optimized code if you have a compiler with decent optimization can be easier to understand the compilers output once you learn not to make dead code.
00000000 <fun>:
0: 1840 adds r0, r0, r1
2: 4770 bx lr
r0 and r1 are the passed in parameters, r0 is where the return value goes, link register is the return address. This is what you would hope a compiler would produce for a function like that.
So now let's try something more complicated.
extern unsigned int more_fun ( unsigned int, unsigned int );
unsigned int fun ( unsigned int a, unsigned int b )
{
return(more_fun(a,b));
}
00000000 <fun>:
0: b510 push {r4, lr}
2: f7ff fffe bl 0 <more_fun>
6: bd10 pop {r4, pc}
A few things things, first why didn't the optimizer do this:
fun:
b more_fun
I don't know.
Why does it say bl 0, more fun is not at zero? This is an object not linked code, once linked the linker will modify that bl instruction to point at more_fun().
Third we already got the compiler to push a register we didn't use. It is pushing and popping r4 so that it can keep the stack aligned per the calling convention used by this compiler. It could have chosen almost any one of the registers, and you may find a gcc or llvm/clang version that uses say r3 instead of r4. gcc has been using r4 for a bit now. It's the first in the list of registers you have to preserve first in the list of registers that if they want to preserve something across a call they will use (as we will see in a second). so perhaps that's why, who knows ask the author.
extern unsigned int more_fun ( unsigned int, unsigned int );
unsigned int fun ( unsigned int a, unsigned int b )
{
more_fun(a,b);
return(a);
}
00000000 <fun>:
0: b510 push {r4, lr}
2: 0004 movs r4, r0
4: f7ff fffe bl 0 <more_fun>
8: 0020 movs r0, r4
a: bd10 pop {r4, pc}
Now we are making progress. So we tell the compiler it has to save the passed in parameter across a function call. Each function starts the rules over, so each function called can trash r0-r3, so if you are using r0-r3 for something you need to save them somewhere. So a very wise choice, instead of saving the passed in parameter on the stack and possibly having to do multiple costly memory cycles to access it. Instead save a callee or callee's callee, etc value on the stack and use a register within our function to save that parameter, as a design it saves a lot of wasted cycles. We needed the stack to be aligned anyway so this all worked out preserve r4 and save the return address since we are making a call ourselves which will trash it. Save the parameter we need after the call into r4. Make the call place the return value in the return register and return. Cleaning up the stack as you go. So the stack frame here is minimal if at all. Not using the stack much.
extern unsigned int more_fun ( unsigned int, unsigned int );
unsigned int fun ( unsigned int a, unsigned int b )
{
b<<=more_fun(a,b);
return(a+b);
}
00000000 <fun>:
0: b570 push {r4, r5, r6, lr}
2: 0005 movs r5, r0
4: 000c movs r4, r1
6: f7ff fffe bl 0 <more_fun>
a: 4084 lsls r4, r0
c: 1960 adds r0, r4, r5
e: bd70 pop {r4, r5, r6, pc}
we did it again we got the compiler to have to save a register we didn't use to keep the alignment. And we are using more of the stack but would you call that a stack frame? We forced the compiler to have to preserve both incoming parameters through a subroutine call.
extern unsigned int more_fun ( unsigned int, unsigned int );
unsigned int fun ( unsigned int a, unsigned int b, unsigned int c, unsigned int d )
{
b<<=more_fun(b,c);
c<<=more_fun(c,d);
d<<=more_fun(b,d);
return(a+b+c+d);
}
0: b5f8 push {r3, r4, r5, r6, r7, lr}
2: 000c movs r4, r1
4: 0007 movs r7, r0
6: 0011 movs r1, r2
8: 0020 movs r0, r4
a: 001d movs r5, r3
c: 0016 movs r6, r2
e: f7ff fffe bl 0 <more_fun>
12: 0029 movs r1, r5
14: 4084 lsls r4, r0
16: 0030 movs r0, r6
18: f7ff fffe bl 0 <more_fun>
1c: 0029 movs r1, r5
1e: 4086 lsls r6, r0
20: 0020 movs r0, r4
22: f7ff fffe bl 0 <more_fun>
26: 4085 lsls r5, r0
28: 19a4 adds r4, r4, r6
2a: 19e4 adds r4, r4, r7
2c: 1960 adds r0, r4, r5
2e: bdf8 pop {r3, r4, r5, r6, r7, pc}
What is it going to take? we at least did get it to save r3 to even out the stack. I bet we can push it now...
extern unsigned int more_fun ( unsigned int, unsigned int );
unsigned int fun ( unsigned int a, unsigned int b, unsigned int c, unsigned int d, unsigned int e, unsigned int f )
{
b<<=more_fun(b,c);
c<<=more_fun(c,d);
d<<=more_fun(b,d);
e<<=more_fun(e,d);
f<<=more_fun(e,f);
return(a+b+c+d+e+f);
}
00000000 <fun>:
0: b5f0 push {r4, r5, r6, r7, lr}
2: 46c6 mov lr, r8
4: 000c movs r4, r1
6: b500 push {lr}
8: 0011 movs r1, r2
a: 0007 movs r7, r0
c: 0020 movs r0, r4
e: 0016 movs r6, r2
10: 001d movs r5, r3
12: f7ff fffe bl 0 <more_fun>
16: 0029 movs r1, r5
18: 4084 lsls r4, r0
1a: 0030 movs r0, r6
1c: f7ff fffe bl 0 <more_fun>
20: 0029 movs r1, r5
22: 4086 lsls r6, r0
24: 0020 movs r0, r4
26: f7ff fffe bl 0 <more_fun>
2a: 4085 lsls r5, r0
2c: 9806 ldr r0, [sp, #24]
2e: 0029 movs r1, r5
30: f7ff fffe bl 0 <more_fun>
34: 9b06 ldr r3, [sp, #24]
36: 9907 ldr r1, [sp, #28]
38: 4083 lsls r3, r0
3a: 0018 movs r0, r3
3c: 4698 mov r8, r3
3e: f7ff fffe bl 0 <more_fun>
42: 9b07 ldr r3, [sp, #28]
44: 19a4 adds r4, r4, r6
46: 4083 lsls r3, r0
48: 19e4 adds r4, r4, r7
4a: 1964 adds r4, r4, r5
4c: 4444 add r4, r8
4e: 18e0 adds r0, r4, r3
50: bc04 pop {r2}
52: 4690 mov r8, r2
54: bdf0 pop {r4, r5, r6, r7, pc}
56: 46c0 nop ; (mov r8, r8)
Okay that's that is how it is going to be...
extern unsigned int more_fun ( unsigned int, unsigned int );
extern void not_dead ( unsigned int *);
unsigned int fun ( unsigned int a, unsigned int b )
{
unsigned int x[16];
unsigned int ra;
for(ra=0;ra<16;ra++)
{
x[ra]=more_fun(a+ra,b);
}
not_dead(x);
return(ra);
}
00000000 <fun>:
0: b5f0 push {r4, r5, r6, r7, lr}
2: 0006 movs r6, r0
4: b091 sub sp, #68 ; 0x44
6: 0004 movs r4, r0
8: 000f movs r7, r1
a: 466d mov r5, sp
c: 3610 adds r6, #16
e: 0020 movs r0, r4
10: 0039 movs r1, r7
12: f7ff fffe bl 0 <more_fun>
16: 3401 adds r4, #1
18: c501 stmia r5!, {r0}
1a: 42b4 cmp r4, r6
1c: d1f7 bne.n e <fun+0xe>
1e: 4668 mov r0, sp
20: f7ff fffe bl 0 <not_dead>
24: 2010 movs r0, #16
26: b011 add sp, #68 ; 0x44
28: bdf0 pop {r4, r5, r6, r7, pc}
2a: 46c0 nop ; (mov r8, r8)
And there is your stack frame but it doesn't really have a frame pointer and doesn't use the stack to access stuff. Would have to keep working harder to see that, very doable. But hopefully by now you see my point. Your question is about stack frames are structured in compiled code, in particular how a compiler might implement that for a particular target.
Incidentally, this is what clang did with that code.
00000000 <fun>:
0: b5b0 push {r4, r5, r7, lr}
2: af02 add r7, sp, #8
4: b090 sub sp, #64 ; 0x40
6: 460c mov r4, r1
8: 4605 mov r5, r0
a: f7ff fffe bl 0 <more_fun>
e: 9000 str r0, [sp, #0]
10: 1c68 adds r0, r5, #1
12: 4621 mov r1, r4
14: f7ff fffe bl 0 <more_fun>
18: 9001 str r0, [sp, #4]
1a: 1ca8 adds r0, r5, #2
1c: 4621 mov r1, r4
1e: f7ff fffe bl 0 <more_fun>
22: 9002 str r0, [sp, #8]
24: 1ce8 adds r0, r5, #3
26: 4621 mov r1, r4
28: f7ff fffe bl 0 <more_fun>
2c: 9003 str r0, [sp, #12]
2e: 1d28 adds r0, r5, #4
30: 4621 mov r1, r4
32: f7ff fffe bl 0 <more_fun>
36: 9004 str r0, [sp, #16]
38: 1d68 adds r0, r5, #5
3a: 4621 mov r1, r4
3c: f7ff fffe bl 0 <more_fun>
40: 9005 str r0, [sp, #20]
42: 1da8 adds r0, r5, #6
44: 4621 mov r1, r4
46: f7ff fffe bl 0 <more_fun>
4a: 9006 str r0, [sp, #24]
4c: 1de8 adds r0, r5, #7
4e: 4621 mov r1, r4
50: f7ff fffe bl 0 <more_fun>
54: 9007 str r0, [sp, #28]
56: 4628 mov r0, r5
58: 3008 adds r0, #8
5a: 4621 mov r1, r4
5c: f7ff fffe bl 0 <more_fun>
60: 9008 str r0, [sp, #32]
62: 4628 mov r0, r5
64: 3009 adds r0, #9
66: 4621 mov r1, r4
68: f7ff fffe bl 0 <more_fun>
6c: 9009 str r0, [sp, #36] ; 0x24
6e: 4628 mov r0, r5
70: 300a adds r0, #10
72: 4621 mov r1, r4
74: f7ff fffe bl 0 <more_fun>
78: 900a str r0, [sp, #40] ; 0x28
7a: 4628 mov r0, r5
7c: 300b adds r0, #11
7e: 4621 mov r1, r4
80: f7ff fffe bl 0 <more_fun>
84: 900b str r0, [sp, #44] ; 0x2c
86: 4628 mov r0, r5
88: 300c adds r0, #12
8a: 4621 mov r1, r4
8c: f7ff fffe bl 0 <more_fun>
90: 900c str r0, [sp, #48] ; 0x30
92: 4628 mov r0, r5
94: 300d adds r0, #13
96: 4621 mov r1, r4
98: f7ff fffe bl 0 <more_fun>
9c: 900d str r0, [sp, #52] ; 0x34
9e: 4628 mov r0, r5
a0: 300e adds r0, #14
a2: 4621 mov r1, r4
a4: f7ff fffe bl 0 <more_fun>
a8: 900e str r0, [sp, #56] ; 0x38
aa: 350f adds r5, #15
ac: 4628 mov r0, r5
ae: 4621 mov r1, r4
b0: f7ff fffe bl 0 <more_fun>
b4: 900f str r0, [sp, #60] ; 0x3c
b6: 4668 mov r0, sp
b8: f7ff fffe bl 0 <not_dead>
bc: 2010 movs r0, #16
be: b010 add sp, #64 ; 0x40
c0: bdb0 pop {r4, r5, r7, pc}
Now you used the term call stack. The calling convention used by this compiler says that use r0-r3 when possible to pass in the first parameters then use the stack after that.
unsigned int fun ( unsigned int a, unsigned int b, unsigned int c, unsigned int d, unsigned int e )
{
return(a+b+c+d+e);
}
00000000 <fun>:
0: b510 push {r4, lr}
2: 9c02 ldr r4, [sp, #8]
4: 46a4 mov r12, r4
6: 4463 add r3, r12
8: 189b adds r3, r3, r2
a: 185b adds r3, r3, r1
c: 1818 adds r0, r3, r0
e: bd10 pop {r4, pc}
So having more than four parameters the first four are in r0-r3 and then the "call stack" assuming that is what you were referring to is the fifth parameter. The thumb instruction set uses bl as its main call instruction which uses r14 as the return address, unlike other instruction sets that might use the stack to store the return address, ARM uses a register. And the popular ARM calling conventions use registers for the first few operands then use the stack after that.
You would want to look at other instruction sets to see more of a call stack
00000000 <_fun>:
0: 1d80 0008 mov 10(sp), r0
4: 6d80 000a add 12(sp), r0
8: 6d80 0006 add 6(sp), r0
c: 6d80 0004 add 4(sp), r0
10: 6d80 0002 add 2(sp), r0
14: 0087 rts pc