Understanding GCC's alloca() alignment and seemingly missed optimization

Consider the following toy example that allocates memory on the stack by means of the alloca() function:

#include <alloca.h>

void foo() {
    volatile int *p = alloca(4);
    *p = 7;
}

Compiling the function above using gcc 8.2 with -O3 results in the following assembly code:

foo:
   pushq   %rbp
   movq    %rsp, %rbp
   subq    $16, %rsp
   leaq    15(%rsp), %rax
   andq    $-16, %rax
   movl    $7, (%rax)
   leave
   ret

Honestly, I would have expected a more compact assembly code.

16-byte alignment for allocated memory

The instruction andq $-16, %rax in the code above results in rax containing the (only) 16-byte-aligned address between the addresses rsp and rsp + 15 (both inclusive).

This alignment enforcement is the first thing I don't understand: Why does alloca() align the allocated memory to a 16-byte boundary?

Possible missed optimization?

Let's consider anyway that we want the memory allocated by alloca() to be 16-byte aligned. Even so, in the assembly code above, keeping in mind that GCC assumes the stack to be aligned to a 16-byte boundary at the moment of performing the function call (i.e., call foo), if we pay attention to the status of the stack inside foo() just after pushing the rbp register:

Size          Stack          RSP mod 16      Description
-----------------------------------------------------------------------------------
        ------------------
        |       .        |
        |       .        | 
        |       .        |            
        ------------------........0          at "call foo" (stack 16-byte aligned)
8 bytes | return address |
        ------------------........8          at foo entry
8 bytes |   saved RBP    |
        ------------------........0  <-----  RSP is 16-byte aligned!!!

I think that by taking advantage of the red zone (i.e., no need to modify rsp) and the fact that rsp already contains a 16-byte aligned address, the following code could be used instead:

foo:
   pushq   %rbp
   movq    %rsp, %rbp
   movl    $7, -16(%rbp)
   leave
   ret

The address contained in the register rbp is 16-byte aligned, therefore rbp - 16 will also be aligned to a 16-byte boundary.

Even better, the creation of the new stack frame can be optimized away, since rsp is not modified:

foo:
   movl    $7, -8(%rsp)
   ret

Is this just a missed optimization or I am missing something else here?

Solution

The x86-64 System V ABI requires VLAs (C99 Variable Length Arrays) to be 16-byte aligned, same for automatic / static arrays that are >= 16 bytes.

It looks like gcc is treating alloca as a VLA, and failing to do constant-propagation into an alloca that only runs once per function call. (Or that it internally uses alloca for VLAs.)

A generic alloca / VLA can't use the red-zone, in case the runtime value is larger than 128 bytes. GCC also makes a stack frame with RBP instead of saving the allocation size and doing an add rsp, rdx later.

So the asm looks exactly like what it would if the size was a function arg or other runtime variable instead of a constant. That's what led me to this conclusion.

Also alignof(maxalign_t) == 16 , but alloca and malloc can satisfy the requirement to return memory usable for any object without 16-byte alignment for objects smaller than 16 bytes. None of the standard types have alignment requirements wider than their size in x86-64 SysV.

You're right, it should be able to optimize it to this:

void foo() {
    alignas(16) int dummy[1];
    volatile int *p = dummy;   // alloca(4)
    *p = 7;
}

and compile it to the movl $7, -8(%rsp) ; ret you suggested.

The alignas(16) might be optional here for alloca.

If you really need gcc to emit better code when constant propagation makes the arg to alloca a compile-time constant, you could consider simply using a VLA in the first place. GNU C++ supports C99-style VLAs in C++ mode, but ISO C++ (and MSVC) don't.

Or possibly use if(__builtin_constant_p(size)) { VLA version } else { alloca version }, but scoping of VLAs means you can't return a VLA from the scope of an if that detects that we're being inlined with a compile-time constant size. So you'd have to duplicate the code that needs the pointer.