It has been a while since I started working with SSE/AVX intrinsic functions. I recently began writing a header for matrix transposition. I used a lot of if constexpr
branches so that the compiler always selects the optimal instruction set depending on some template parameters. Now I wanted to check if everything works as expected by looking into the local disassembly with objdump
. When using Clang, I get a clear output which basically contains only the assembly instructions corresponding to the utilized intrinsic functions. However, if I use GCC, the disassembly is quite bloated with extra instructions. A quick check on Godbolt shows me that those extra instructions in the GCC disassembly shouldn't be there.
Here is a small example:
#include <x86intrin.h>
#include <array>
std::array<__m256, 1> Test(std::array<__m256, 1> a)
{
std::array<__m256, 1> b;
b[0] = _mm256_unpacklo_ps(a[0], a[0]);
return b;
}
I compile with -march=native -Wall -Wextra -Wpedantic -pthread -O3 -DNDEBUG -std=gnu++1z
. Then I use objdump -S -Mintel libassembly.a > libassembly.dump
on the object file. For Clang (6.0.0), the result is:
In archive libassembly.a:
libAssembly.cpp.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <_Z4TestSt5arrayIDv8_fLm1EE>:
0: c4 e3 7d 04 c0 50 vpermilps ymm0,ymm0,0x50
6: c3 ret
which is the same as Godbolt returns: Godbolt - Clang 6.0.0
For GCC (7.4) the output is
In archive libassembly.a:
libAssembly.cpp.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <_Z4TestSt5arrayIDv8_fLm1EE>:
0: 4c 8d 54 24 08 lea r10,[rsp+0x8]
5: 48 83 e4 e0 and rsp,0xffffffffffffffe0
9: c5 fc 14 c0 vunpcklps ymm0,ymm0,ymm0
d: 41 ff 72 f8 push QWORD PTR [r10-0x8]
11: 55 push rbp
12: 48 89 e5 mov rbp,rsp
15: 41 52 push r10
17: 48 83 ec 28 sub rsp,0x28
1b: 64 48 8b 04 25 28 00 mov rax,QWORD PTR fs:0x28
22: 00 00
24: 48 89 45 e8 mov QWORD PTR [rbp-0x18],rax
28: 31 c0 xor eax,eax
2a: 48 8b 45 e8 mov rax,QWORD PTR [rbp-0x18]
2e: 64 48 33 04 25 28 00 xor rax,QWORD PTR fs:0x28
35: 00 00
37: 75 0c jne 45 <_Z4TestSt5arrayIDv8_fLm1EE+0x45>
39: 48 83 c4 28 add rsp,0x28
3d: 41 5a pop r10
3f: 5d pop rbp
40: 49 8d 62 f8 lea rsp,[r10-0x8]
44: c3 ret
45: c5 f8 77 vzeroupper
48: e8 00 00 00 00 call 4d <_Z4TestSt5arrayIDv8_fLm1EE+0x4d>
As you can see, there are a lot of additional instructions. In contrast to that, Godbolt does not include all these extra instructions: Godbolt - GCC 7.4
So what is going on here? I have just started learning assembly, so maybe it is totally clear to someone with assembly experience, but I am a little bit confused why GCC creates those extra instructions on my machine.
Greetings and thank you in advance.
EDIT
To avoid further confusions, I just compiled using:
gcc-7 -I/usr/local/include -O3 -march=native -Wall -Wextra -Wpedantic -pthread -std=gnu++1z -o test.o -c /<PathToFolder>/libAssembly.cpp
Output remains the same. I am not sure if this is relevant, but it generates the warning:
warning: ignoring attributes on template argument ‘__m256 {aka __vector(8) float}’ [-Wignored-attributes]
Usually I surpress this warning and it shouldn't be an issue:
Implication of GCC warning: ignoring attributes on template argument (-Wignored-attributes)
Processor is Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
Here is the gcc -v
:
gcc-7 -v
Using built-in specs.
COLLECT_GCC=gcc-7
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/7/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 7.4.0-1ubuntu1~18.04.1' --with-bugurl=file:///usr/share/doc/gcc-7/README.Bugs --enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++ --prefix=/usr --with-gcc-major-version-only --program-suffix=-7 --program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-libmpx --enable-plugin --enable-default-pie --with-system-zlib --with-target-system-zlib --enable-objc-gc=auto --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-offload-targets=nvptx-none --without-cuda-driver --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1)
Use -fno-stack-protector
Your local GCC defaults to -fstack-protector-strong
but Godbolt's GCC install doesn't.
mov rax,QWORD PTR fs:0x28
is the telltale clue; Thread-local storage at fs:40
aka fs:0x28
is where GCC keeps its stack cookie constant. The call
after the ret
is call __stack_chk_fail
(but you disassembled a .o
without using objdump -dr
to show relocations, so the placeholder +0
offset just looked like still a target within this function).
Since you have arrays (or a class containing an array), stack-protector-strong kicks in even though their sizes are compile-time constants. So you get the code to store the stack cookie, then check it and branch on stack overflow. (Even the array of size 1 in this MVCE is enough to trigger that.)
Making arrays on the stack with 32-byte alignment (for __m256
) requires 32-byte alignment, and your GCC is older than GCC8 so you get the ridiculously clunky stack-alignment code that builds a full copy of the stack frame including a return address. Generated assembly for extended alignment of stack variables (To be clear, GCC8 still does align the stack here, just wasting fewer instructions on it.)
This is pretty much a missed optimization; gcc never actually spills or reloads to those arrays so it could have just optimized them away, along with the stack alignment, like it did without stack-protector.
More recent GCC is better at optimizing away stack alignment after optimizing away the memory for aligned locals in more cases, but this has been a persistent missed optimization in AVX code. Fortunately the cost is pretty negligible in a function that loops; as long as small helper functions inline.
Compiling on Godbolt with -fstack-protector-strong
reproduces your output. Newer GCC, including current trunk pre-10, still has both missed optimizations, but stack alignment costs fewer instructions because it just uses RBP as a frame pointer and aligns RSP, then references locals relative to aligned RSP. It still checks the stack cookie (with no instructions between storing it and checking it).
On your desktop, compiling with -fno-stack-protector
should make good asm.