understanding aarch64 assembly function call, how stack is operated?

test.c (bare metal)

#include <stdio.h>

int add1(int a, int b)
{
int c;
c = a + b;
return c;
}

int main()
{
int x, y, z;
x = 3;
y = 4;
z = add1(x,y);
printf("z = %d\n", z);
}

I do aarch64-none-elf-gcc test.c -specs=rdimon.specs and get a.out. I do aarch64-none-elf-objdump -d a.out and got the assemlby code. Here is the main function.

00000000004002e0 <add1>:
  4002e0:   d10083ff    sub sp, sp, #0x20       <-- reduce sp by 0x20 (just above it are saved fp and lr of main)
  4002e4:   b9000fe0    str w0, [sp, #12]       <-- save first param x at sp + 12
  4002e8:   b9000be1    str w1, [sp, #8]        <-- save second param y at sp + 8
  4002ec:   b9400fe1    ldr w1, [sp, #12]       <-- load w1 with x
  4002f0:   b9400be0    ldr w0, [sp, #8]        <-- load w0 with y
  4002f4:   0b000020    add w0, w1, w0          <-- w0 = w1 + w0
  4002f8:   b9001fe0    str w0, [sp, #28]       <-- store x0 to sp+28
  4002fc:   b9401fe0    ldr w0, [sp, #28]       <-- load w0 with the result (seems redundant)
  400300:   910083ff    add sp, sp, #0x20       <-- increment sp by 0x20
  400304:   d65f03c0    ret
0000000000400308 <main>:
  400308:   a9be7bfd    stp x29, x30, [sp, #-32]!   <-- save x29(fp) and x30(lr) at sp - 0x20
  40030c:   910003fd    mov x29, sp                 <-- set fp to new sp, the base of stack growth(down)
  400310:   52800060    mov w0, #0x3                    // #3
  400314:   b9001fe0    str w0, [sp, #28]           <-- x is assigned in sp + #28
  400318:   52800080    mov w0, #0x4                    // #4
  40031c:   b9001be0    str w0, [sp, #24]           <-- y is assiged in sp + #24
  400320:   b9401be1    ldr w1, [sp, #24]            <-- load func param for y
  400324:   b9401fe0    ldr w0, [sp, #28]           <-- load func param for x
  400328:   97ffffee    bl  4002e0 <add1>           <-- call add1 (args are in w0, w1)
  40032c:   b90017e0    str w0, [sp, #20]           <-- store x0(result z) to sp+20
  400330:   b94017e1    ldr w1, [sp, #20]           <-- load w1 with the result (why? seems redundant. it's already in w0)
  400334:   d0000060    adrp    x0, 40e000 <__sfp_handle_exceptions+0x28>
  400338:   91028000    add x0, x0, #0xa0  <-- looks like loading param x0 for printf
  40033c:   940000e7    bl  4006d8 <printf>
  400340:   52800000    mov w0, #0x0                    // #0 <-- for main's return value..
  400344:   a8c27bfd    ldp x29, x30, [sp], #32  <-- recover x29 and x30 (look's like values in x29, x30 was used in the fuction who called main)
  400348:   d65f03c0    ret
  40034c:   d503201f    nop

I added my understanding with <-- mark. Could someone see the code and give me some corrections? Any small comment will be appreciated. (please see from <main>)

ADD : Thanks for the comments. I think I forget to ask my real questions. At the start of main, the program who called main should have put it's return address(after main) in x30. And since main should call another function itself, it should modify x30, so it saves x30 in its stack. But why does it store it in sp - #0x20? and why are the variables x,y,z stored in sp + #20, sp + #24, sp + #28? If the main function calls printf, I guess sp and x29 will be decremented by some amount. Is this amount dependent on how much stack area the called function(here printf) uses? or is it constant? and how is the x29, x30 storage location in main determined? Is it determined so that those two values are located just above the called function(printf)'s stack area? Sorry for too many questions.

Solution

In laying out the stack for main, the compiler has to satisfy the following constraints:

x29 and x30 need to be saved on the stack. They occupy 8 bytes each.
The local variables x,y,z need stack space, 4 bytes each. (If you were optimizing, you'd see them kept in registers instead, or optimized completely out of existence.) That brings us to a total of 8+8+4+4+4=28 bytes.
The stack pointer sp must always be kept aligned to 16 bytes; this is an architectural and ABI constraint (the OS can choose to relax this requirement but normally doesn't). So we can't just subtract 28 from sp; we must round up to the next multiple of 16, which is 32.

So that's where the 32 or 0x20 that you mention comes from. Note that it is entirely for stack memory used by main itself. It's not a universal constant; you would see it change if you added or removed enough local variables from main.

It has nothing to do with whatever printf needs. If printf needs stack space for its own local variables, then the code within printf will have to take care of adjusting the stack pointer accordingly. The compiler when compiling main does not know how much space that would be, and does not care.

Now the compiler needs to organize these five objects x29, x30, x, y, z within the 32 bytes of stack space that it will create for itself. The choice of what to put where could be almost completely arbitrary, except for the following point.

The function's prologue needs to both subtract 32 from the stack pointer, and store the registers x29, x30 somewhere within the allocated space. This can all be done in a single instruction with the pre-indexed store-pair instruction stp x29, x30, [sp, #-32]!. It subtracts 32 from sp, then stores x29 and x30 in the 16 bytes starting at the address where sp now points. So in order to use this instruction, we have to accept placing x29 and x30 at the bottom of the allocated space, at offsets [sp+0] and [sp+8] relative to the new value of sp. Putting them anywhere else would require extra instructions and be less efficient.

(Actually, because this is the most convenient way to do it, the ABI actually requires that stack frames be set up this way, with x29, x30 contiguous on the stack in that order, when they are used at all (5.2.3).)

We still have 16 bytes starting at [sp+16] to play with, in which x,y,z must be placed. The compiler has chosen to put them at addresses [sp+28], [sp+24], [sp+20] respectively. The 4 bytes at [sp+16] remain unused, but remember, we had to waste 4 bytes somewhere in order to achieve the proper stack alignment. The choice of arranging these objects, and which slot to leave unused, was completely arbitrary and any other arrangement would have worked just as well.