assembly clang x86-64 micro-optimization

Why does clang's epilogue use `add $N, %rsp` instead of `mov %rbp, %rsp` to restore `%rsp`?

Consider the following:

ammarfaizi2@integral:/tmp$ vi test.c
ammarfaizi2@integral:/tmp$ cat test.c

extern void use_buffer(void *buf);

void a_func(void)
{
    char buffer[4096];
    use_buffer(buffer);
}

__asm__("emit_mov_rbp_to_rsp:\n\tmovq %rbp, %rsp");

ammarfaizi2@integral:/tmp$ clang -Wall -Wextra -c -O3 -fno-omit-frame-pointer test.c -o test.o
ammarfaizi2@integral:/tmp$ objdump -d test.o

test.o:     file format elf64-x86-64


Disassembly of section .text:

0000000000000000 <emit_mov_rbp_to_rsp>:
   0: 48 89 ec              mov    %rbp,%rsp
   3: 66 2e 0f 1f 84 00 00  cs nopw 0x0(%rax,%rax,1)
   a: 00 00 00 
   d: 0f 1f 00              nopl   (%rax)

0000000000000010 <a_func>:
  10: 55                    push   %rbp
  11: 48 89 e5              mov    %rsp,%rbp
  14: 48 81 ec 00 10 00 00  sub    $0x1000,%rsp
  1b: 48 8d bd 00 f0 ff ff  lea    -0x1000(%rbp),%rdi
  22: e8 00 00 00 00        call   27 <a_func+0x17>
  27: 48 81 c4 00 10 00 00  add    $0x1000,%rsp
  2e: 5d                    pop    %rbp
  2f: c3                    ret    
ammarfaizi2@integral:/tmp$

At the end of a_func(), before return, it's the function epilogue that restores %rsp. It uses add $0x1000, %rsp which yields 48 81 c4 00 10 00 00.

Can't it just use mov %rbp, %rsp which only yields 3 bytes 48 89 ec?

Why clang doesn't use the shorter way (mov %rbp, %rsp)?

With code size trade-off, what is the advantage of using add $0x1000, %rsp instead of mov %rbp, %rsp?

Update (extra)

Even with -Os, it still results in the same code. So I think there must be a rational reason to avoid mov %rbp, %rsp.

ammarfaizi2@integral:/tmp$ clang -Wall -Wextra -c -Os -fno-omit-frame-pointer test.c -o test.o
ammarfaizi2@integral:/tmp$ objdump -d test.o

test.o:     file format elf64-x86-64


Disassembly of section .text:

0000000000000000 <emit_mov_rbp_to_rsp>:
   0:   48 89 ec                mov    %rbp,%rsp

0000000000000003 <a_func>:
   3:   55                      push   %rbp
   4:   48 89 e5                mov    %rsp,%rbp
   7:   48 81 ec 00 10 00 00    sub    $0x1000,%rsp
   e:   48 8d bd 00 f0 ff ff    lea    -0x1000(%rbp),%rdi
  15:   e8 00 00 00 00          call   1a <a_func+0x17>
  1a:   48 81 c4 00 10 00 00    add    $0x1000,%rsp
  21:   5d                      pop    %rbp
  22:   c3                      ret    
ammarfaizi2@integral:/tmp$

Solution

If it's using RBP as a frame pointer at all, yes, mov %rbp, %rsp would be more compact and AFAIK at least as fast on all x86 microarchitectures. (mov-elimination probably even works on it). Even moreso when the add constant doesn't fit in an imm8.

This is probably a missed optimization, very similar to https://bugs.llvm.org/show_bug.cgi?id=10319 (which proposes using leave instead of mov/pop, which would cost 1 extra uop on Intel but save another 3 bytes). It points out the overall static code size savings are pretty small in normal cases, but isn't considering efficiency benefits. In normal builds (-O2 without -fno-omit-frame-pointer) only a few functions will use a frame pointer at all (only when using VLA / alloca, or over-aligning the stack) so the possible benefit is even smaller.

It seems from that bug that it's just a peephole the LLVM doesn't bother to look for, because many functions also need to restore other registers, so you actually need to add some other value to point RSP below other pushes.

(GCC sometimes uses mov to restore call-preserved regs so it can use leave. With a frame pointer, that makes the addressing mode fairly compact to encode, although a 4-byte qword mov -8(%rbp), %r12 is still not as small as 2-byte pop of course. And if we don't have a frame pointer (e.g. in -O2 code), mov %rbp, %rsp was never an option.)

Before considering the "not worth looking for" reason, I thought of another minor benefit:

After calling a function that saves/restores RBP, RBP is a load result. So after mov %rbp, %rsp, future use of RSP would need to wait for that load. Possibly some corner cases end up bottlenecked on store-fowrwarding latency, vs. register modification just being 1 cycle.

But that seems unlikely to be worth the extra code size in general; I expect such corner cases are rare. Although that new RSP value is needed for a pop %rbp, so then the caller's restored RBP value is the result of a chain of two loads after we return. (Fortunately ret has branch prediction to hide latency.)

So it might be worth trying both ways in some benchmarks; e.g. comparing this vs. a tweaked version of LLVM on some standard benchmarks like SPECint.