assembly gcc compiler-construction x86-64

How does gcc choose to number temporary variables from -fverbose-asm?

Having this simple c:

#define _XOPEN_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <alloca.h>

int main(){
    char *buf = alloca(600);
    snprintf(buf,600,"hi!, %d, %d, %d\n", 1,2,3);
    puts(buf);
}

Generates on $ cc -S -fverbose-asm a.c:
.

text
    .section    .rodata
.LC0:
    .string "hi!, %d, %d, %d\n"
    .text
    .globl  main
    .type   main, @function
main:
    pushq   %rbp    #
    movq    %rsp, %rbp  #,
    subq    $16, %rsp   #,
# a.c:7:    char *buf = alloca(600);
    movl    $16, %eax   #, tmp102
    subq    $1, %rax    #, tmp89
    addq    $608, %rax  #, tmp90
    movl    $16, %ecx   #, tmp103
    movl    $0, %edx    #, tmp93
    divq    %rcx    # tmp103
    imulq   $16, %rax, %rax #, tmp92, tmp94
    subq    %rax, %rsp  # tmp94,
    movq    %rsp, %rax  #, tmp95
    addq    $15, %rax   #, tmp96
    shrq    $4, %rax    #, tmp97
    salq    $4, %rax    #, tmp98
    movq    %rax, -8(%rbp)  # tmp98, buf
# a.c:8:    snprintf(buf,600,"hi!, %d, %d, %d\n", 1,2,3);
   ...

Upon which does gcc decide to number those temporary variables? (tmp102, tmp89, tmp90, ...)?

Also, can someone explain, why alloca uses %rax (addq $608, %rax) for allocated memory instead of %rsp (subq $608, %rsp)? which is what alloca is for (according to man page) : The alloca() function allocates size bytes of space in the stack frame of the caller.

Solution

How can have variables intermediate representation, when majority of them is immediate?

In an SSA (Static Single Assignment) internal representation of the program logic (like GCC's GIMPLE), every temporary value has a separate name. I'd assume the numbers come from auto-numbered SSA variables when there isn't a C variable name directly associated. But I'm not familiar with GCC internals enough to give any more details. If you're really curious, you could always look through the GCC source code yourself. But I'm fairly confident that auto-numbered SSA vars explains it, and makes total sense.

Numeric literals don't actually get any name with -fverbose-asm. e.g. in the optimized GCC output (from Godbolt) we see this as part of putting args in registers:

...
        movl    $3, %r9d        #,
        movl    $2, %r8d        #,
        xorl    %eax, %eax      #
...

re: alloca: It is eventually offsetting RSP, with subq %rax, %rsp, after rounding the allocation size up to a multiple of 16.

This rounding maintains stack alignment. (Please at least try to google it yourself. When you're missing a lot of background knowledge and concepts, you can't expect answers to fully explain everything from the ground up. When you don't understand the details of something, start by searching on technical terms that get used.)

BTW, that's amazingly inefficient asm from gcc -O0! It seems to be using x / 16 * 16 instead of x & 0xFFFF...F0 as part of rounding the allocation size up to a multiple of 16. (If you single-step with a debugger, you can see the sequence of div and imul are doing that.)

I guess the canned sequence of logic for the builtin function was written that way for some reason, and at -O0 GCC didn't do constant propagation through it. But anyway, that's why it's using RAX.

Perhaps the alloca logic is written in GIMPLE, or maybe RTL code that doesn't get expanded until after some transformation passes. That would explain why it's optimized so poorly even though it's all part of a single statement. gcc -O0 is very bad for performance, but a 64-bit div to divide by 16 is very bad, compared to a very cheap and with an immediate operand. It's also very strange to see a multiply by a power of 2 as an immediate operand in asm; in normal cases the compiler would optimize that into a shift.

To see non-terrible asm, look at what happens with optimization enabled, e.g. on Godbolt. See also How to remove "noise" from GCC/clang assembly output?. Then it does just sub $616, %rsp. But then it wastes instructions at runtime aligning a pointer into that space (to guarantee the space will be 16-byte aligned), even though RSP's alignment is statically known after that.

# GCC10.1 -O3 -fverbose-asm with alloca
...
        subq    $616, %rsp           # reserve 600 + 16 bytes
        leaq    15(%rsp), %r12
        andq    $-16, %r12           # get a 16-byte aligned pointer into it
        movq    %r12, %rdi           # save the pointer for later instead of recalc before next call
        call    snprintf        #

Silly compiler, the alignment of %rsp is statically known at that point, no (x+15) & -16 needed. Note that -16 = 0xFFFFFFFFFFFFFFF0 in 64-bit 2's complement, so it's a handy way to express AND masks that clear some low bits.

Removing alloca and using a plain local array gives even simpler code:

# GCC10.1 -O3 with char buf[600]
        subq    $616, %rsp
...
        movq    %rsp, %rdi
...
        call    snprintf        #