Consider the following:
ammarfaizi2@integral:/tmp$ vi test.c
ammarfaizi2@integral:/tmp$ cat test.c
extern void use_buffer(void *buf);
void a_func(void)
{
char buffer[4096];
use_buffer(buffer);
}
__asm__("emit_mov_rbp_to_rsp:\n\tmovq %rbp, %rsp");
ammarfaizi2@integral:/tmp$ clang -Wall -Wextra -c -O3 -fno-omit-frame-pointer test.c -o test.o
ammarfaizi2@integral:/tmp$ objdump -d test.o
test.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <emit_mov_rbp_to_rsp>:
0: 48 89 ec mov %rbp,%rsp
3: 66 2e 0f 1f 84 00 00 cs nopw 0x0(%rax,%rax,1)
a: 00 00 00
d: 0f 1f 00 nopl (%rax)
0000000000000010 <a_func>:
10: 55 push %rbp
11: 48 89 e5 mov %rsp,%rbp
14: 48 81 ec 00 10 00 00 sub $0x1000,%rsp
1b: 48 8d bd 00 f0 ff ff lea -0x1000(%rbp),%rdi
22: e8 00 00 00 00 call 27 <a_func+0x17>
27: 48 81 c4 00 10 00 00 add $0x1000,%rsp
2e: 5d pop %rbp
2f: c3 ret
ammarfaizi2@integral:/tmp$
At the end of a_func()
, before return, it's the function epilogue that restores %rsp
. It uses add $0x1000, %rsp
which yields 48 81 c4 00 10 00 00
.
Can't it just use mov %rbp, %rsp
which only yields 3 bytes 48 89 ec
?
Why clang doesn't use the shorter way (mov %rbp, %rsp
)?
With code size trade-off, what is the advantage of using add $0x1000, %rsp
instead of mov %rbp, %rsp
?
Even with -Os
, it still results in the same code. So I think there must be a rational reason to avoid mov %rbp, %rsp
.
ammarfaizi2@integral:/tmp$ clang -Wall -Wextra -c -Os -fno-omit-frame-pointer test.c -o test.o
ammarfaizi2@integral:/tmp$ objdump -d test.o
test.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <emit_mov_rbp_to_rsp>:
0: 48 89 ec mov %rbp,%rsp
0000000000000003 <a_func>:
3: 55 push %rbp
4: 48 89 e5 mov %rsp,%rbp
7: 48 81 ec 00 10 00 00 sub $0x1000,%rsp
e: 48 8d bd 00 f0 ff ff lea -0x1000(%rbp),%rdi
15: e8 00 00 00 00 call 1a <a_func+0x17>
1a: 48 81 c4 00 10 00 00 add $0x1000,%rsp
21: 5d pop %rbp
22: c3 ret
ammarfaizi2@integral:/tmp$
If it's using RBP as a frame pointer at all, yes, mov %rbp, %rsp
would be more compact and AFAIK at least as fast on all x86 microarchitectures. (mov-elimination probably even works on it). Even moreso when the add constant doesn't fit in an imm8.
This is probably a missed optimization, very similar to https://bugs.llvm.org/show_bug.cgi?id=10319 (which proposes using leave
instead of mov/pop, which would cost 1 extra uop on Intel but save another 3 bytes). It points out the overall static code size savings are pretty small in normal cases, but isn't considering efficiency benefits. In normal builds (-O2
without -fno-omit-frame-pointer
) only a few functions will use a frame pointer at all (only when using VLA / alloca, or over-aligning the stack) so the possible benefit is even smaller.
It seems from that bug that it's just a peephole the LLVM doesn't bother to look for, because many functions also need to restore other registers, so you actually need to add
some other value to point RSP below other pushes.
(GCC sometimes uses mov
to restore call-preserved regs so it can use leave
. With a frame pointer, that makes the addressing mode fairly compact to encode, although a 4-byte qword mov -8(%rbp), %r12
is still not as small as 2-byte pop of course. And if we don't have a frame pointer (e.g. in -O2
code), mov %rbp, %rsp
was never an option.)
Before considering the "not worth looking for" reason, I thought of another minor benefit:
After calling a function that saves/restores RBP, RBP is a load result. So after mov %rbp, %rsp
, future use of RSP would need to wait for that load. Possibly some corner cases end up bottlenecked on store-fowrwarding latency, vs. register modification just being 1 cycle.
But that seems unlikely to be worth the extra code size in general; I expect such corner cases are rare. Although that new RSP value is needed for a pop %rbp
, so then the caller's restored RBP value is the result of a chain of two loads after we return. (Fortunately ret
has branch prediction to hide latency.)
So it might be worth trying both ways in some benchmarks; e.g. comparing this vs. a tweaked version of LLVM on some standard benchmarks like SPECint.