Having this simple c:
#define _XOPEN_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <alloca.h>
int main(){
char *buf = alloca(600);
snprintf(buf,600,"hi!, %d, %d, %d\n", 1,2,3);
puts(buf);
}
Generates on $ cc -S -fverbose-asm a.c
:
.
text
.section .rodata
.LC0:
.string "hi!, %d, %d, %d\n"
.text
.globl main
.type main, @function
main:
pushq %rbp #
movq %rsp, %rbp #,
subq $16, %rsp #,
# a.c:7: char *buf = alloca(600);
movl $16, %eax #, tmp102
subq $1, %rax #, tmp89
addq $608, %rax #, tmp90
movl $16, %ecx #, tmp103
movl $0, %edx #, tmp93
divq %rcx # tmp103
imulq $16, %rax, %rax #, tmp92, tmp94
subq %rax, %rsp # tmp94,
movq %rsp, %rax #, tmp95
addq $15, %rax #, tmp96
shrq $4, %rax #, tmp97
salq $4, %rax #, tmp98
movq %rax, -8(%rbp) # tmp98, buf
# a.c:8: snprintf(buf,600,"hi!, %d, %d, %d\n", 1,2,3);
...
Upon which does gcc decide to number those temporary variables? (tmp102, tmp89, tmp90, ...)?
Also, can someone explain, why alloca
uses %rax
(addq $608, %rax
) for allocated memory instead of %rsp
(subq $608, %rsp
)? which is what alloca
is for (according to man page) :
The alloca() function allocates size bytes of space in the stack frame
of the caller.
How can have variables intermediate representation, when majority of them is immediate?
In an SSA (Static Single Assignment) internal representation of the program logic (like GCC's GIMPLE), every temporary value has a separate name. I'd assume the numbers come from auto-numbered SSA variables when there isn't a C variable name directly associated. But I'm not familiar with GCC internals enough to give any more details. If you're really curious, you could always look through the GCC source code yourself. But I'm fairly confident that auto-numbered SSA vars explains it, and makes total sense.
Numeric literals don't actually get any name with -fverbose-asm
. e.g. in the optimized GCC output (from Godbolt) we see this as part of putting args in registers:
...
movl $3, %r9d #,
movl $2, %r8d #,
xorl %eax, %eax #
...
re: alloca: It is eventually offsetting RSP, with subq %rax, %rsp
, after rounding the allocation size up to a multiple of 16.
This rounding maintains stack alignment. (Please at least try to google it yourself. When you're missing a lot of background knowledge and concepts, you can't expect answers to fully explain everything from the ground up. When you don't understand the details of something, start by searching on technical terms that get used.)
BTW, that's amazingly inefficient asm from gcc -O0
! It seems to be using x / 16 * 16
instead of x & 0xFFFF...F0
as part of rounding the allocation size up to a multiple of 16. (If you single-step with a debugger, you can see the sequence of div
and imul
are doing that.)
I guess the canned sequence of logic for the builtin function was written that way for some reason, and at -O0 GCC didn't do constant propagation through it. But anyway, that's why it's using RAX.
Perhaps the alloca logic is written in GIMPLE, or maybe RTL code that doesn't get expanded until after some transformation passes. That would explain why it's optimized so poorly even though it's all part of a single statement. gcc -O0
is very bad for performance, but a 64-bit div
to divide by 16 is very bad, compared to a very cheap and
with an immediate operand. It's also very strange to see a multiply by a power of 2 as an immediate operand in asm; in normal cases the compiler would optimize that into a shift.
To see non-terrible asm, look at what happens with optimization enabled, e.g. on Godbolt. See also How to remove "noise" from GCC/clang assembly output?. Then it does just sub $616, %rsp
. But then it wastes instructions at runtime aligning a pointer into that space (to guarantee the space will be 16-byte aligned), even though RSP's alignment is statically known after that.
# GCC10.1 -O3 -fverbose-asm with alloca
...
subq $616, %rsp # reserve 600 + 16 bytes
leaq 15(%rsp), %r12
andq $-16, %r12 # get a 16-byte aligned pointer into it
movq %r12, %rdi # save the pointer for later instead of recalc before next call
call snprintf #
Silly compiler, the alignment of %rsp
is statically known at that point, no (x+15) & -16
needed. Note that -16
= 0xFFFFFFFFFFFFFFF0
in 64-bit 2's complement, so it's a handy way to express AND masks that clear some low bits.
Removing alloca and using a plain local array gives even simpler code:
# GCC10.1 -O3 with char buf[600]
subq $616, %rsp
...
movq %rsp, %rdi
...
call snprintf #