assembly x86 32bit-64bit sse memory-alignment

Stack alignment when using SIMD instructions

In the book about assembly that I am reading, we are told for any function we write, if it's a branching function and will call other functions, it must maintain stack alignment. This is done so that SIMD instructions can be used by functions called by our own function.

So far I've been told for x86 we must keep a 16-byte stack alignment for SIMD instructions. Is it always 16 bytes for all x86 programs, 32-bit and 64-bit, that are using SIMD?
Does it change based on the x86 Operating System we are building the program for?

Solution

Functions can't know what other functions will do internally, so what really matters for being able to link libraries together and to executables is that they agree on a calling convention / ABI, and that ABI sets requirements for callers which produce guarantees for callees about stack alignment. (And other things.) So it's not "when using SIMD instructions", unless you mean "in case any callee actually does depend on the ABI guarantee, e.g. by using SIMD load or store on its stack space". As in glibc scanf Segmentation faults when called from a function that doesn't align RSP

See Why does the x86-64 / AMD64 System V ABI mandate a 16 byte stack alignment? for more details about some things I mention in this answer.

64-bit mode: always aligned by 16: Both x86-64 System V and Windows x64 ABIs require RSP%16 == 0 before a call, and thus guarantee RSP % 16 == 8 on function entry. That's sufficient for 16-byte vectors, but functions that want alignas(32) or higher for locals still need to do it themselves.

32-bit mode: 4-byte alignment on non-Linux. Only the version of the i386 System V ABI used on Linux requires 16-byte alignment (ESP % 16 == 0 before a call, ESP % 16 == 12 on function entry.) Even other OSes using the SysV ABI kept the old 4-byte alignment requirement, not adopting that change (e.g. *BSD, and maybe Mac OS X before it went 64-bit-only). And 32-bit code on Windows also only requires / guarantees 4-byte alignment.

If you (or a compiler) want 16-byte aligned locals (e.g. to spill/reload an __m128), the function needs extra instructions. (Typically setting up EBP as a frame pointer and and esp, -16, similar to when allocating space for a VLA.)

The ABI requirement for maintaining 16-byte stack alignment in all functions in 32-bit mode on GNU/Linux was an accident on GCC's part. By the time they noticed the bug that -mpreferred-stack-boundary=4 was letting GCC assume alignment and make code that faulted if called without that alignment, there were binaries in the wild that relied on it, including in major distros that change slowly like RedHat Enterprise Linux (RHEL). The least-bad way out of this situation was to change the ABI to require that going forward, so -mpreferred-stack-boundary=4 became part of the ABI, not just an optimistic performance tweak like I think it was imagined when it was made the default.

This change did in fact break hand-written asm that called C functions with previously-allowed ESP alignment that was less than 16, but such binaries would likely continue to be created by the defaults from GCC versions that were widespread when this was noticed. So changing the ABI to match what released versions of GCC were actually doing was not great but potentially less bad. Breakage in practice for old libraries with new executables would be limited to callback functions, or other ways for old code to call new code. (New code calling old code is fine because a looser alignment requirement is satisfied by callers that give 16-byte alignment.)

Other OSes avoided this ABI-change debacle that broke old binaries and hand-written asm.

See https://sourceforge.net/p/fbc/bugs/659/ for some history, and my comment on https://gcc.gnu.org/bugzilla/show_bug.cgi?id=40838#c91 for another attempt at summarizing the unfortunate history of how i386 GNU/Linux + GCC accidentally got into a situation where a backwards-incompat change to the i386 System V ABI was the lesser of two evils.