I ran sudo perf record -F 99 find /
followed by sudo perf report
and selected "Annotate fdopendir" and here are the first seven instructions:
push %rbp
push %rbx
mov %edi,%esi
mov %edi,%ebx
mov $0x1,%edi
sub $0xa8,%rsp
mov %rsp,%rbp
The first instruction appears to be saving the caller's base frame pointer. I believe instructions 2 through 5 are irrelevant to this question but here for completeness. Instructions 6 and 7 are confusing to me. Shouldn't the assignment of rbp to rsp occur before subtracting 0xa8 from rsp?
The x86-64 System V ABI doesn't require making a traditional / legacy stack-frame. This looks close to a traditional stack frame setup, but it's definitely not because there's no mov %rsp, %rbp
right after the first push %rbp
.
We're seeing compiler-generated code that simply uses RBP as a temporary register, and is using it to hold a pointer to a local on the stack. It's just a coincidence that this happens to involve the instruction mov %rsp, %rbp
sometime after push %rbp
. This is not making a stack frame.
In x86-64 System V, RBX and RBP are the only 2 "low 8" registers that are call-preserved, and thus usable without REX prefixes in some cases (e.g. for the push/pop, and when used in addressing modes), saving code-size. GCC prefers to use them before saving/restoring any of R12..R15. What registers are preserved through a linux x86-64 function call (For pointers, copying them with mov
always requires a REX prefix for 64-bit operand-size, so there are fewer savings than for 32-bit integers, but gcc still goes for RBX then RBP, in that order, when it needs to save/restore call-preserved regs in a function.)
Disassembly of /lib/libc.so.6
(glibc) on my system (Arch Linux) shows similar but different code-gen for fdopendir
. You stopped the disassembly too soon, before it makes a function call. That sheds some light on why it wanted a call-preserved temporary register: it wanted the var in a reg across the call.
00000000000c1260 <fdopendir>:
c1260: 55 push %rbp
c1261: 89 fe mov %edi,%esi
c1263: 53 push %rbx
c1264: 89 fb mov %edi,%ebx
c1266: bf 01 00 00 00 mov $0x1,%edi
c126b: 48 81 ec a8 00 00 00 sub $0xa8,%rsp
c1272: 64 48 8b 04 25 28 00 00 00 mov %fs:0x28,%rax # stack-check cookie
c127b: 48 89 84 24 98 00 00 00 mov %rax,0x98(%rsp)
c1283: 31 c0 xor %eax,%eax
c1285: 48 89 e5 mov %rsp,%rbp # save a pointer
c1288: 48 89 ea mov %rbp,%rdx # and pass it as a function arg
c128b: e8 90 7d 02 00 callq e9020 <__fxstat>
c1290: 85 c0 test %eax,%eax
c1292: 78 6a js c12fe <fdopendir+0x9e>
c1294: 8b 44 24 18 mov 0x18(%rsp),%eax
c1298: 25 00 f0 00 00 and $0xf000,%eax
c129d: 3d 00 40 00 00 cmp $0x4000,%eax
c12a2: 75 4c jne c12f0 <fdopendir+0x90>
....
c12c1: 48 89 e9 mov %rbp,%rcx # pass the pointer as the 4th arg
c12c4: 89 c2 mov %eax,%edx
c12c6: 31 f6 xor %esi,%esi
c12c8: 89 df mov %ebx,%edi
c12ca: e8 d1 f7 ff ff callq c0aa0 <__alloc_dir>
c12cf: 48 8b 8c 24 98 00 00 00 mov 0x98(%rsp),%rcx
c12d7: 64 48 33 0c 25 28 00 00 00 xor %fs:0x28,%rcx # check the stack cookie
c12e0: 75 38 jne c131a <fdopendir+0xba>
c12e2: 48 81 c4 a8 00 00 00 add $0xa8,%rsp
c12e9: 5b pop %rbx
c12ea: 5d pop %rbp
c12eb: c3 retq
This is pretty silly code-gen; gcc could have simply used mov %rsp, %rcx
the 2nd time it needed it. I'd call this a missed-optimization. It never needed that pointer in a call-preserved register because it always knew where it was relative to RSP.
(Even if it hadn't been exactly at RSP+0, lea something(%rsp), %rdx
and lea something(%rsp), %rcx
would have been totally fine the two times it was needed, with probably less total cost than saving/restoring RBP + the required mov
instructions.)
Or it could have used mov 0x18(%rbp),%eax
instead of rsp to save a byte of code-size in that addressing mode. Avoiding direct references to RSP between function calls reduces the amount of stack-sync uops Intel CPUs need to insert.