gcc 8.2+ doesn't always align the stack before a call on x86?

The current (Linux) version of the SysV i386 ABI requires 16-byte stack alignment before a call:

The end of the input argument area shall be aligned on a 16 (32, if __m256 is passed on stack) byte boundary. In other words, the value (%esp + 4) is always a multiple of 16 (32) when control is transferred to the function entry point.

On GCC 8.1 this code aligns the stack to 16-byte boundary prior to the call to callee: (Godbolt)

source	# bytes
call	4
push ebp	4
sub esp, 24	24
sub esp, 4	4
push eax	4
push eax	4
push eax	4
Total	48

On all versions of GCC 8.2 and later, it aligns to a 4-byte boundary: (Godbolt)

source	# bytes
call	4
push ebp	4
sub esp, 16	16
push eax	4
push eax	4
push eax	4
Total	36

Easily verifiable if we shorten or raise the number of parameters required by callee.

Changing -mprefered-stack-boundary bizarrely changes the operand to the sub instruction, but does nothing to change the actual stack alignment: (Godbolt)

So, uh, what gives?

Solution

Since you provided a definition of the function in the same translation unit, apparently GCC sees that the function doesn't care about stack alignment and doesn't bother much with it. And apparently this basic inter-procedural analysis / optimization (IPA) is on by default even at -O0.

Turns out this option even has an obvious name when I searched for "ipa" options in the manual: -fipa-stack-alignment is on by default even at -O0. Manually turning it off with -fno-ipa-stack-alignment results in what you expected, a second sub whose value depends on the number of pushes (Godbolt), making sure ESP is aligned by 16 before a call like modern Linux versions of the i386 SysV ABI use.

Or if you change the definition to just a declaration, then the resulting asm is as expected, fully respecting -mpreferred-stack-boundary.

void callee(void* a, void* b) {
}

void callee(void* a, void* b);

Using -fPIC also forces GCC to not assume anything about the callee, so it does respect the possibility of function interposition (e.g. via LD_PRELOAD) with the appropriate option.

Without compiling for a shared library, GCC is allowed to assume that any definition it sees for a global function is the definition, thanks to ISO C's one-definition-rule.

If you use __attribute__((noipa)) on the function definition, then call sites won't assume anything based on the definition. Just like if you'd renamed the definition (so you could still look at it) and provided only a declaration of the name the caller uses.

If you just want to stop inlining, you can use __attribute__((noinline,noclone)) instead, to still allow the callsite to be like it would if the optimizer simply chose not to inline, but could still see this definition. That may or may not be what you want.

See also How to remove "noise" from GCC/clang assembly output? re: writing functions whose asm is interesting to look at, and compiler options.

And BTW, I found it easiest to change the declaration / definition to variadic, so I could add or remove args with only a change to the caller. I was still able to reproduce your result of that not changing the sub amount even when the push amount changes with an extra arg, when there's a definition, but not with just a declaration.

void callee(void* a, ...)  // {}   // comment out a body or not
;