Why does GCC create extra assembly instructions on my machine?

It has been a while since I started working with SSE/AVX intrinsic functions. I recently began writing a header for matrix transposition. I used a lot of if constexpr branches so that the compiler always selects the optimal instruction set depending on some template parameters. Now I wanted to check if everything works as expected by looking into the local disassembly with objdump. When using Clang, I get a clear output which basically contains only the assembly instructions corresponding to the utilized intrinsic functions. However, if I use GCC, the disassembly is quite bloated with extra instructions. A quick check on Godbolt shows me that those extra instructions in the GCC disassembly shouldn't be there.

Here is a small example:

#include <x86intrin.h>
#include <array>

std::array<__m256, 1> Test(std::array<__m256, 1> a)
{
    std::array<__m256, 1> b;

    b[0] = _mm256_unpacklo_ps(a[0], a[0]);
    return b;
}

I compile with -march=native -Wall -Wextra -Wpedantic -pthread -O3 -DNDEBUG -std=gnu++1z. Then I use objdump -S -Mintel libassembly.a > libassembly.dump on the object file. For Clang (6.0.0), the result is:

In archive libassembly.a:

libAssembly.cpp.o:     file format elf64-x86-64


Disassembly of section .text:

0000000000000000 <_Z4TestSt5arrayIDv8_fLm1EE>:
   0:   c4 e3 7d 04 c0 50       vpermilps ymm0,ymm0,0x50
   6:   c3                      ret

which is the same as Godbolt returns: Godbolt - Clang 6.0.0

For GCC (7.4) the output is

In archive libassembly.a:

libAssembly.cpp.o:     file format elf64-x86-64


Disassembly of section .text:

0000000000000000 <_Z4TestSt5arrayIDv8_fLm1EE>:
   0:   4c 8d 54 24 08          lea    r10,[rsp+0x8]
   5:   48 83 e4 e0             and    rsp,0xffffffffffffffe0
   9:   c5 fc 14 c0             vunpcklps ymm0,ymm0,ymm0
   d:   41 ff 72 f8             push   QWORD PTR [r10-0x8]
  11:   55                      push   rbp
  12:   48 89 e5                mov    rbp,rsp
  15:   41 52                   push   r10
  17:   48 83 ec 28             sub    rsp,0x28
  1b:   64 48 8b 04 25 28 00    mov    rax,QWORD PTR fs:0x28
  22:   00 00 
  24:   48 89 45 e8             mov    QWORD PTR [rbp-0x18],rax
  28:   31 c0                   xor    eax,eax
  2a:   48 8b 45 e8             mov    rax,QWORD PTR [rbp-0x18]
  2e:   64 48 33 04 25 28 00    xor    rax,QWORD PTR fs:0x28
  35:   00 00 
  37:   75 0c                   jne    45 <_Z4TestSt5arrayIDv8_fLm1EE+0x45>
  39:   48 83 c4 28             add    rsp,0x28
  3d:   41 5a                   pop    r10
  3f:   5d                      pop    rbp
  40:   49 8d 62 f8             lea    rsp,[r10-0x8]
  44:   c3                      ret    
  45:   c5 f8 77                vzeroupper 
  48:   e8 00 00 00 00          call   4d <_Z4TestSt5arrayIDv8_fLm1EE+0x4d>

As you can see, there are a lot of additional instructions. In contrast to that, Godbolt does not include all these extra instructions: Godbolt - GCC 7.4

So what is going on here? I have just started learning assembly, so maybe it is totally clear to someone with assembly experience, but I am a little bit confused why GCC creates those extra instructions on my machine.

Greetings and thank you in advance.

EDIT

To avoid further confusions, I just compiled using:

gcc-7 -I/usr/local/include -O3 -march=native -Wall -Wextra -Wpedantic -pthread -std=gnu++1z -o test.o -c /<PathToFolder>/libAssembly.cpp

Output remains the same. I am not sure if this is relevant, but it generates the warning: warning: ignoring attributes on template argument ‘__m256 {aka __vector(8) float}’ [-Wignored-attributes]

Usually I surpress this warning and it shouldn't be an issue:

Implication of GCC warning: ignoring attributes on template argument (-Wignored-attributes)

Processor is Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz

Here is the gcc -v:

gcc-7 -v
Using built-in specs.
COLLECT_GCC=gcc-7
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/7/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 7.4.0-1ubuntu1~18.04.1' --with-bugurl=file:///usr/share/doc/gcc-7/README.Bugs --enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++ --prefix=/usr --with-gcc-major-version-only --program-suffix=-7 --program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-libmpx --enable-plugin --enable-default-pie --with-system-zlib --with-target-system-zlib --enable-objc-gc=auto --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-offload-targets=nvptx-none --without-cuda-driver --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1)

Solution

Use -fno-stack-protector

Your local GCC defaults to -fstack-protector-strong but Godbolt's GCC install doesn't.

mov rax,QWORD PTR fs:0x28 is the telltale clue; Thread-local storage at fs:40 aka fs:0x28 is where GCC keeps its stack cookie constant. The call after the ret is call __stack_chk_fail (but you disassembled a .o without using objdump -dr to show relocations, so the placeholder +0 offset just looked like still a target within this function).

Since you have arrays (or a class containing an array), stack-protector-strong kicks in even though their sizes are compile-time constants. So you get the code to store the stack cookie, then check it and branch on stack overflow. (Even the array of size 1 in this MVCE is enough to trigger that.)

Making arrays on the stack with 32-byte alignment (for __m256) requires 32-byte alignment, and your GCC is older than GCC8 so you get the ridiculously clunky stack-alignment code that builds a full copy of the stack frame including a return address. Generated assembly for extended alignment of stack variables (To be clear, GCC8 still does align the stack here, just wasting fewer instructions on it.)

This is pretty much a missed optimization; gcc never actually spills or reloads to those arrays so it could have just optimized them away, along with the stack alignment, like it did without stack-protector.

More recent GCC is better at optimizing away stack alignment after optimizing away the memory for aligned locals in more cases, but this has been a persistent missed optimization in AVX code. Fortunately the cost is pretty negligible in a function that loops; as long as small helper functions inline.

Compiling on Godbolt with -fstack-protector-strong reproduces your output. Newer GCC, including current trunk pre-10, still has both missed optimizations, but stack alignment costs fewer instructions because it just uses RBP as a frame pointer and aligns RSP, then references locals relative to aligned RSP. It still checks the stack cookie (with no instructions between storing it and checking it).

On your desktop, compiling with -fno-stack-protector should make good asm.