I am not quite familiar with assembly code. Excuse me if this question is naive.
I have a simple C program:
int f1(int a1, int a2, int a3, int a4, int a5, int a6, int a7, int a8, int a9)
{
int c = 3;
int d = 4;
return a1 + a2 + a3 + a4 + a5 + a6 + a7 + a8 + a9 + c + d;
}
int main(int argc, char** argv)
{
f1(1, 2, 3, 4, 5, 6, 7, 8, 9);
}
I compiled it into an elf64-x86-64 and get below disassembly code:
f1():
0000000000000000 <f1>:
0: 55 push %rbp
1: 48 89 e5 mov %rsp,%rbp
4: 89 7d ec mov %edi,-0x14(%rbp) ; 1
7: 89 75 e8 mov %esi,-0x18(%rbp) ; 2
a: 89 55 e4 mov %edx,-0x1c(%rbp) ; 3
d: 89 4d e0 mov %ecx,-0x20(%rbp) ; 4
10: 44 89 45 dc mov %r8d,-0x24(%rbp) ; 5
14: 44 89 4d d8 mov %r9d,-0x28(%rbp) ; 6
18: c7 45 f8 03 00 00 00 movl $0x3,-0x8(%rbp) ; c = 3
1f: c7 45 fc 04 00 00 00 movl $0x4,-0x4(%rbp) ; d = 4
26: 8b 45 e8 mov -0x18(%rbp),%eax ;2
29: 8b 55 ec mov -0x14(%rbp),%edx ; 1
2c: 01 c2 add %eax,%edx
2e: 8b 45 e4 mov -0x1c(%rbp),%eax ;3
31: 01 c2 add %eax,%edx
33: 8b 45 e0 mov -0x20(%rbp),%eax ;4
36: 01 c2 add %eax,%edx
38: 8b 45 dc mov -0x24(%rbp),%eax ;5
3b: 01 c2 add %eax,%edx
3d: 8b 45 d8 mov -0x28(%rbp),%eax ; 6
40: 01 c2 add %eax,%edx
42: 8b 45 10 mov 0x10(%rbp),%eax ;7
45: 01 c2 add %eax,%edx
47: 8b 45 18 mov 0x18(%rbp),%eax ; 8
4a: 01 c2 add %eax,%edx
4c: 8b 45 20 mov 0x20(%rbp),%eax ; 9
4f: 01 c2 add %eax,%edx
51: 8b 45 f8 mov -0x8(%rbp),%eax ; c =3
54: 01 c2 add %eax,%edx
56: 8b 45 fc mov -0x4(%rbp),%eax ; d =4
59: 01 d0 add %edx,%eax
5b: 5d pop %rbp
5c: c3 retq
main():
000000000000005d <main>:
5d: 55 push %rbp
5e: 48 89 e5 mov %rsp,%rbp
61: 48 83 ec 30 sub $0x30,%rsp
65: 89 7d fc mov %edi,-0x4(%rbp)
68: 48 89 75 f0 mov %rsi,-0x10(%rbp)
6c: c7 44 24 10 09 00 00 movl $0x9,0x10(%rsp)
73: 00
74: c7 44 24 08 08 00 00 movl $0x8,0x8(%rsp)
7b: 00
7c: c7 04 24 07 00 00 00 movl $0x7,(%rsp)
83: 41 b9 06 00 00 00 mov $0x6,%r9d
89: 41 b8 05 00 00 00 mov $0x5,%r8d
8f: b9 04 00 00 00 mov $0x4,%ecx
94: ba 03 00 00 00 mov $0x3,%edx
99: be 02 00 00 00 mov $0x2,%esi
9e: bf 01 00 00 00 mov $0x1,%edi
a3: b8 00 00 00 00 mov $0x0,%eax
a8: e8 00 00 00 00 callq ad <main+0x50>
ad: c9 leaveq
ae: c3 retq
It seems there are some holes on the stack when passing parameters from main()
to f1()
:
My questions are:
Why need these holes?
And why do we need below 2 lines of assembly? If they are meant for context restoring, I don't see any instructions to do that. And the %rsi
register is NOT even used elsewhere. Why still save %rsi
on stack?
65: 89 7d fc mov %edi,-0x4(%rbp)
68: 48 89 75 f0 mov %rsi,-0x10(%rbp)
1 ~ 6
have already been passed via the registers, why move them back to memory at the beginning of f1()
?Arg passing in the x86-64 System V ABI uses 8-byte "slots" on the stack, for args that don't fit in registers. Anything that isn't a multiple of 8 bytes will have holes (padding) before the next stack arg.
This is pretty standard for calling conventions across OSes / architectures. Passing a short
in a 32-bit calling convention will use a 4-byte stack slot (or take up a whole 4-byte register, whether or not it's sign-extended to the full register width).
Your last 2 questions are really asking the same thing:
You're compiling without optimization, so for consistent debugging every variable including function args needs a memory address where a debugger could modify the value when stopped at a breakpoint. This includes main
's argc
and argv
, as well as the register args to f1
.
If you defined main
as int main(void)
(which is one of two valid signatures for main
in hosted C implementations, the other being int main(int argc, char**argv)
), there'd be no incoming args for main to spill.
If you compiled with optimization enabled, there'd be none of that crap. See How to remove "noise" from GCC/clang assembly output? for suggestions on how to get compilers to make asm that's nice to look at. e.g. from the Godbolt compiler explorer, compiled with gcc -O3 -fPIC
1, you get:
f1:
addl %esi, %edi # a2, tmp106 # tmp106 = a1 + a2
movl 8(%rsp), %eax # a7, tmp110
addl %edx, %edi # a3, tmp107
addl %ecx, %edi # a4, tmp108
addl %r8d, %edi # a5, tmp109
addl %r9d, %edi # a6, tmp110
addl %edi, %eax # tmp110, tmp110
addl 16(%rsp), %eax # a8, tmp112
addl 24(%rsp), %eax # a9, tmp113
addl $7, %eax #, tmp105 # c+d = constant 7
ret
(I used AT&T syntax instead of Intel because you used that in your question)
IDK exactly why gcc reserves somewhat more stack space than it actually needs; this sometimes happens even with optimization enabled. e.g. gcc's main
looks like this:
# gcc -O3
main:
subq $16, %rsp # useless; the space isn't used and it doesn't change stack alignment.
movl $6, %r9d
movl $5, %r8d
movl $4, %ecx
pushq $9
movl $3, %edx
movl $2, %esi
movl $1, %edi
pushq $8
pushq $7
call f1@PLT
xorl %eax, %eax # implicit return 0
addq $40, %rsp
ret
All the extra crap that's going on in your version of the function is a consequence of the anti-optimizations required for consistent debugging, which you get with the default -O0
. (Consistent debugging means you can set
a variable when stopped at a breakpoint, and even jump
to another source line inside the same function, and the program will still run and work as you'd expect in the C abstract machine. So the compiler can't keep anything in registers across statements, or optimize based on anything other than literal constants inside a statement.)
-O0
also means compile fast, and don't try to allocate stack space efficiently.
Footnote 1: -fPIC
prevents gcc from optimizing away the call in main
.
Without that, even with __attribute__((noinline))
, it can see that the function has no side effects so it can just omit the call instead of inlining it and optimizing it away.
But -fPIC
means generate code for a shared library, which (when targeting Linux) means symbol interposition is possible, so the compiler can't assume call f1@plt
will actually call this definition of f1
, and thus can't optimize based on it having no side effects.
clang apparently assumes that it still can optimize that way even with -fPIC
, so I guess clang assumes that conflicting definitions of the same function are not allowed or something? This would seem to break LD_PRELOAD overrides of library functions for calls from within the library.