I have this simple code in c
#include <stdio.h>
#include <alloca.h>
int main()
{
char* buffer = (char*)alloca(600);
snprintf(buffer, 600, "Hello %d %d %d\n", 1, 2, 3);
return 0;
}
I would expect that generated assembly code for alloca function would just decrement stack pointer(one sub instruction), and maybe do some alignments (one and instruction), but the resulting assembly code is very complicated and even more inefficient than you'd expect.
This is the output of objdump -d main.o
, on the output of gcc -c
(with no optimization, so the default -O0
)
0000000000400596 <main>:
400596: 55 push %rbp
400597: 48 89 e5 mov %rsp,%rbp
40059a: 48 83 ec 10 sub $0x10,%rsp
40059e: b8 10 00 00 00 mov $0x10,%eax
4005a3: 48 83 e8 01 sub $0x1,%rax
4005a7: 48 05 60 02 00 00 add $0x260,%rax
4005ad: b9 10 00 00 00 mov $0x10,%ecx
4005b2: ba 00 00 00 00 mov $0x0,%edx
4005b7: 48 f7 f1 div %rcx
4005ba: 48 6b c0 10 imul $0x10,%rax,%rax
4005be: 48 29 c4 sub %rax,%rsp
4005c1: 48 89 e0 mov %rsp,%rax
4005c4: 48 83 c0 0f add $0xf,%rax
4005c8: 48 c1 e8 04 shr $0x4,%rax
4005cc: 48 c1 e0 04 shl $0x4,%rax
4005d0: 48 89 45 f8 mov %rax,-0x8(%rbp)
4005d4: 48 8b 45 f8 mov -0x8(%rbp),%rax
4005d8: 41 b9 03 00 00 00 mov $0x3,%r9d
4005de: 41 b8 02 00 00 00 mov $0x2,%r8d
4005e4: b9 01 00 00 00 mov $0x1,%ecx
4005e9: ba a8 06 40 00 mov $0x4006a8,%edx
4005ee: be 58 02 00 00 mov $0x258,%esi
4005f3: 48 89 c7 mov %rax,%rdi
4005f6: b8 00 00 00 00 mov $0x0,%eax
4005fb: e8 a0 fe ff ff callq 4004a0 <snprintf@plt>
400600: b8 00 00 00 00 mov $0x0,%eax
400605: c9 leaveq
400606: c3 retq
400607: 66 0f 1f 84 00 00 00 nopw 0x0(%rax,%rax,1)
40060e: 00 00
Any idea what is the aim of this generated assembly code? I'm using gcc 8.3.1.
There is of course the usual debug-mode / anti-optimized behaviour of compiling each C statement to a separate block, with non-register
variables actually in memory. (Why does clang produce inefficient asm with -O0 (for this simple floating point sum)?).
But yes, this goes beyond "not optimized". No sane person would expect GCC's canned sequence of instructions (or GIMPLE or RTL logic, whatever stage it's expanded) for alloca
logic to involve a div
by compile-time-constant power of 2, instead of a shift or just an AND. x /= 16;
doesn't compile to a div
if you write that yourself in C source, even with gcc -O0
.
Normally GCC does compile-time evaluation of constant expressions as much as possible, like x = 5 * 6
won't use imul at runtime. But the point at which it expands its alloca
logic must be after that point, probably pretty late (after most other passes) to explain all those missed optimizations. So it doesn't benefit from the same passes that operate on your C source logic.
It's doing 2 things:
round the allocation size up (a constant 600
after it puts that in a register) to a multiple of 16 by doing: ((16ULL - 1) + x) / 16 * 16
. A sane compiler would at least use right/left shift, if not optimize that to (x+15) & -16
. But unfortunately GCC uses div
and imul
by 16, even though it's a constant power of 2.
Round the final address of the allocated space to a multiple of 16 (even though it already was because RSP started 16-byte aligned and the allocation size was rounded up.) It does this with ((p+15) >> 4) << 4
which is much more efficient than div/imul (especially for 64-bit operand-size on Intel before Ice Lake), but still less efficient than and $-16, %rax
. And of course silly to do work that was already pointless.
Then of course it has to store the pointer into char* buffer
.
And in the block of asm for the next statement, reload it as an arg for sprintf
(inefficiently into RAX instead of directly into RDI, typical for gcc -O0
), along with setting up the register args.
So this sucks a lot, but is very plausibly explained by late expansion of the canned logic for alloca
, after most transformation ("optimization") passes have already run. Note that -O0
doesn't literally mean "no optimization", it just means "compile fast, and give consistent debugging".
Related:
How does gcc choose to number temporary variables from -fverbose-asm? - another discussion of that -O0
alloca asm, with the same guess about expanding it late in GIMPLE passes, or even in RTL. Also has optimized asm for alloca / snprintf which is vastly simpler. In fact that's almost a duplicate; that question did also ask about the alloca code.
doing seemingly un-needed ops (crackme) - I very lightly commented basically the same asm (for 32-bit mode), but mostly it's discussing hand-obfuscated asm.
How does GCC implement variable-length arrays? shows the 32-bit version of this bad code, but doesn't comment on how much it sucks.