assembly x86-64 disassembly calling-convention

Why there are holes on the stack when passing parameters?

I am not quite familiar with assembly code. Excuse me if this question is naive.

I have a simple C program:

int f1(int a1, int a2, int a3, int a4, int a5, int a6, int a7, int a8, int a9)
{
  int c = 3;
  int d = 4;
  return a1 + a2 + a3 + a4 + a5 + a6 + a7 + a8 + a9 + c + d;
}

int main(int argc, char** argv)
{
  f1(1, 2, 3, 4, 5, 6, 7, 8, 9);
}

I compiled it into an elf64-x86-64 and get below disassembly code:

f1():

0000000000000000 <f1>:
   0:   55                      push   %rbp
   1:   48 89 e5                mov    %rsp,%rbp
   4:   89 7d ec                mov    %edi,-0x14(%rbp)      ; 1
   7:   89 75 e8                mov    %esi,-0x18(%rbp)       ; 2
   a:   89 55 e4                mov    %edx,-0x1c(%rbp)      ; 3
   d:   89 4d e0                mov    %ecx,-0x20(%rbp)      ; 4
  10:   44 89 45 dc             mov    %r8d,-0x24(%rbp)  ; 5
  14:   44 89 4d d8             mov    %r9d,-0x28(%rbp)  ; 6
  18:   c7 45 f8 03 00 00 00    movl   $0x3,-0x8(%rbp) ; c = 3
  1f:   c7 45 fc 04 00 00 00    movl   $0x4,-0x4(%rbp) ; d = 4
  26:   8b 45 e8                mov    -0x18(%rbp),%eax     ;2
  29:   8b 55 ec                mov    -0x14(%rbp),%edx    ; 1
  2c:   01 c2                   add    %eax,%edx                
  2e:   8b 45 e4                mov    -0x1c(%rbp),%eax     ;3
  31:   01 c2                   add    %eax,%edx
  33:   8b 45 e0                mov    -0x20(%rbp),%eax     ;4
  36:   01 c2                   add    %eax,%edx
  38:   8b 45 dc                mov    -0x24(%rbp),%eax     ;5
  3b:   01 c2                   add    %eax,%edx
  3d:   8b 45 d8                mov    -0x28(%rbp),%eax    ; 6
  40:   01 c2                   add    %eax,%edx
  42:   8b 45 10                mov    0x10(%rbp),%eax     ;7
  45:   01 c2                   add    %eax,%edx
  47:   8b 45 18                mov    0x18(%rbp),%eax    ; 8
  4a:   01 c2                   add    %eax,%edx
  4c:   8b 45 20                mov    0x20(%rbp),%eax    ; 9
  4f:   01 c2                   add    %eax,%edx
  51:   8b 45 f8                mov    -0x8(%rbp),%eax    ; c =3
  54:   01 c2                   add    %eax,%edx
  56:   8b 45 fc                mov    -0x4(%rbp),%eax    ; d =4
  59:   01 d0                   add    %edx,%eax
  5b:   5d                      pop    %rbp
  5c:   c3                      retq

main():

000000000000005d <main>:
  5d:   55                      push   %rbp
  5e:   48 89 e5                mov    %rsp,%rbp
  61:   48 83 ec 30             sub    $0x30,%rsp
  65:   89 7d fc                mov    %edi,-0x4(%rbp)
  68:   48 89 75 f0             mov    %rsi,-0x10(%rbp)
  6c:   c7 44 24 10 09 00 00    movl   $0x9,0x10(%rsp)
  73:   00 
  74:   c7 44 24 08 08 00 00    movl   $0x8,0x8(%rsp)
  7b:   00 
  7c:   c7 04 24 07 00 00 00    movl   $0x7,(%rsp)
  83:   41 b9 06 00 00 00       mov    $0x6,%r9d
  89:   41 b8 05 00 00 00       mov    $0x5,%r8d
  8f:   b9 04 00 00 00          mov    $0x4,%ecx
  94:   ba 03 00 00 00          mov    $0x3,%edx
  99:   be 02 00 00 00          mov    $0x2,%esi
  9e:   bf 01 00 00 00          mov    $0x1,%edi
  a3:   b8 00 00 00 00          mov    $0x0,%eax
  a8:   e8 00 00 00 00          callq  ad <main+0x50>
  ad:   c9                      leaveq 
  ae:   c3                      retq

It seems there are some holes on the stack when passing parameters from main() to f1():

My questions are:

Why need these holes?
And why do we need below 2 lines of assembly? If they are meant for context restoring, I don't see any instructions to do that. And the %rsi register is NOT even used elsewhere. Why still save %rsi on stack?

65: 89 7d fc mov %edi,-0x4(%rbp) 68: 48 89 75 f0 mov %rsi,-0x10(%rbp)

And just came up with one more question, since the args 1 ~ 6 have already been passed via the registers, why move them back to memory at the beginning of f1()?

Solution

Arg passing in the x86-64 System V ABI uses 8-byte "slots" on the stack, for args that don't fit in registers. Anything that isn't a multiple of 8 bytes will have holes (padding) before the next stack arg.

This is pretty standard for calling conventions across OSes / architectures. Passing a short in a 32-bit calling convention will use a 4-byte stack slot (or take up a whole 4-byte register, whether or not it's sign-extended to the full register width).

Your last 2 questions are really asking the same thing:

You're compiling without optimization, so for consistent debugging every variable including function args needs a memory address where a debugger could modify the value when stopped at a breakpoint. This includes main's argc and argv, as well as the register args to f1.

If you defined main as int main(void) (which is one of two valid signatures for main in hosted C implementations, the other being int main(int argc, char**argv)), there'd be no incoming args for main to spill.

If you compiled with optimization enabled, there'd be none of that crap. See How to remove "noise" from GCC/clang assembly output? for suggestions on how to get compilers to make asm that's nice to look at. e.g. from the Godbolt compiler explorer, compiled with gcc -O3 -fPIC¹, you get:

f1:
    addl    %esi, %edi      # a2, tmp106    # tmp106 = a1 + a2
    movl    8(%rsp), %eax   # a7, tmp110
    addl    %edx, %edi      # a3, tmp107
    addl    %ecx, %edi      # a4, tmp108
    addl    %r8d, %edi      # a5, tmp109
    addl    %r9d, %edi      # a6, tmp110
    addl    %edi, %eax      # tmp110, tmp110
    addl    16(%rsp), %eax  # a8, tmp112
    addl    24(%rsp), %eax  # a9, tmp113
    addl    $7, %eax        #, tmp105       # c+d = constant 7
    ret

(I used AT&T syntax instead of Intel because you used that in your question)

IDK exactly why gcc reserves somewhat more stack space than it actually needs; this sometimes happens even with optimization enabled. e.g. gcc's main looks like this:

# gcc -O3
main:
    subq    $16, %rsp    # useless; the space isn't used and it doesn't change stack alignment.
    movl    $6, %r9d
    movl    $5, %r8d
    movl    $4, %ecx
    pushq   $9
    movl    $3, %edx
    movl    $2, %esi
    movl    $1, %edi
    pushq   $8
    pushq   $7
    call    f1@PLT
    xorl    %eax, %eax    # implicit return 0
    addq    $40, %rsp
    ret

All the extra crap that's going on in your version of the function is a consequence of the anti-optimizations required for consistent debugging, which you get with the default -O0. (Consistent debugging means you can set a variable when stopped at a breakpoint, and even jump to another source line inside the same function, and the program will still run and work as you'd expect in the C abstract machine. So the compiler can't keep anything in registers across statements, or optimize based on anything other than literal constants inside a statement.)

-O0 also means compile fast, and don't try to allocate stack space efficiently.

Footnote 1: -fPIC prevents gcc from optimizing away the call in main.

Without that, even with __attribute__((noinline)), it can see that the function has no side effects so it can just omit the call instead of inlining it and optimizing it away.

But -fPIC means generate code for a shared library, which (when targeting Linux) means symbol interposition is possible, so the compiler can't assume call f1@plt will actually call this definition of f1, and thus can't optimize based on it having no side effects.

clang apparently assumes that it still can optimize that way even with -fPIC, so I guess clang assumes that conflicting definitions of the same function are not allowed or something? This would seem to break LD_PRELOAD overrides of library functions for calls from within the library.