c++gcc optimization x86 memory-alignment

`movaps` vs. `movups` in GCC: how does it decide?

I recently researched a segfault in a piece of software compiled with GCC 8. The code looked as follows (this is just a sketch)

struct Point
{
  int64_t x, y;
};

struct Edge
{
  // some other fields
  // ...
  Point p; // <- at offset `0xC0`

  Edge(const Point &p) p(p) {}
};

Edge *create_edge(const Point &p)
{
  void *raw_memory = my_custom_allocator(sizeof(Edge));
  return new (raw_memory) Edge(p);
}

The key point here is that my_custom_allocator() returns pointers to unaligned memory. The code crashes because in order to copy the original point p into the field Edge::p of the new object the compiler used a movdqu/movaps pair in the [inlined] constructor code

movdqu 0x0(%rbp), %xmm1  ; read the original object at `rbp`
...
movaps %xmm1, 0xc0(%rbx) ; store it into the new `Edge` object at `rbx` - crash!

At first, everything seems to be clear here: the memory is not properly aligned, movaps crashes. My fault.

But is it?

Attempting to reproduce the problem on Godbolt I observe that GCC 8 actually attempts to handle it fairly intelligently. When it is sure that the memory is properly aligned it uses movaps, just like in my code. This

#include <new>
#include <cstdlib>

struct P { unsigned long long x, y; };

unsigned char buffer[sizeof(P) * 100];

void *alloc()
{
  return buffer;
}

void foo(const P& s)
{
  void *raw = alloc();
  new (raw) P(s);
}

results in this

foo(P const&):
    movdqu  xmm0, XMMWORD PTR [rsi]
    movaps  XMMWORD PTR buffer[rip], xmm0
    ret

https://godbolt.org/z/a3uSid

But when it is not sure, it uses movups. E.g. if I "hide" the definition of the allocator in the above example, it will opt for movups in the same code

foo(P const&):
    push    rbx
    mov     rbx, rdi
    call    alloc()
    movdqu  xmm0, XMMWORD PTR [rbx]
    movups  XMMWORD PTR [rax], xmm0
    pop     rbx
    ret

https://godbolt.org/z/cNKe5A

So, if it is supposed to behave that way, why is it using movaps in the software I mentioned at the beginning of this post? In my case the implementation of my_custom_allocator() is not visible to the compiler at the point of the call, which is why I'd expect GCC to opt for movups.

What are the other factors that might be at play here? Is it a bug in GCC? How can I force GCC to use movups, preferably everywhere?

Solution

Update: alignof(Edge) was 16 because of long double on x86-64 System V, so it's UB to have one at a less-aligned address. This tells GCC it's safe to use movaps.

IDK why loading it from (%rbp) didn't also use movaps. I thought that implied Edge wouldn't be 16-byte aligned, so there's a whole section of this answer based on that guess (which I moved to the end).

Some types can require 16-byte alignment, notably `long double`

alignof(max_align_t) == 16 on x86-64 System V. A drop-in replacement for malloc needs to return memory at least that aligned, for allocations of 16 bytes or larger.

(Smaller allocations of course couldn't hold a 16-byte object and therefore can't require 16-byte alignment. You can ask for a specific instance of an object to be over-aligned with alignas(16) int foo;, but if a type itself has higher alignment it also has larger sizeof so an array will still obey the normal rules as well as having every element satisfy the alignment requirement.)

See also Why does unaligned access to mmap'ed memory sometimes segfault on AMD64? where auto-vectorization with a misaligned uint16_t* leads to a segfault. Also Pascal Cuoq's blog about alignment and having objects with less alignment than alignof(T) is undefined behaviour, and how assumption of no UB runs deep for compilers.

Instruction selection

GCC and clang use movaps whenever they can prove that memory must be sufficiently aligned. (By assuming no UB). On Core2 and earlier, and K10 and earlier, unaligned store instructions are slow even if the memory happens to be aligned at runtime.

Nehalem and Bulldozer changed that, but GCC still uses movaps even with -mtune=haswell, or even vmovaps with -march=haswell even though that can only execute on CPUs with cheap vmovups.

MSVC and ICC never use movaps, hurting perf on very old CPUs but letting you get away with misaligning data sometimes. They will fold aligned loads into memory operands for SSE instructions like paddd xmm0, [rdi] (which requires alignment, unlike the AVX1 equivalent) so they will still make code that faults on misalignment sometimes, but usually only with optimization enabled. IMO that's not particularly great.

alignof(Point) should only be 8 (inheriting the alignment of its most-aligned member, an int64_t). So GCC can only prove 8-byte alignment for an arbitrary Point, not 16.

For static storage, GCC can know that it chose to align the array by 16 and thus can use movaps / movdqa to load from it. (Also, the x86-64 System V ABI requires that static arrays of 16 bytes or larger be aligned by 16, so GCC can assume this even for an extern unsigned char buffer[] global defined in some other compilation unit.)

You haven't shown a definition for Edge so IDK why it has 16-byte alignment, but possibly alignof(Edge) == 16? Otherwise yes, that might to be a compiler bug.

But the fact that it loads the original Edge object from the stack with movups would seem to indicate that alignof(Edge) < 16

Possibly raw_memory = __builtin_assume_aligned(raw_memory, 8); could help? IDK if that can tell GCC to assume lower alignment than it already thought it could assume based on other factors.

You could tell GCC that Edge (or int for that matter) can always be under-aligned by defining a typedef like this:

typedef long __attribute__((aligned(1), may_alias)) unaligned_aliasing_long;

may_alias is actually orthogonal to alignment, but it's worth mentioning because one of the use-cases for this would be loads out of a char[] buffer for parsing a byte stream. In that case you'd want both. That's an alternative to using memcpy(tmp, src, sizeof(tmp)); to do unaligned strict-aliasing-safe loads.

GCC uses may_alias to define __m128, and may_alias,aligned(1) as part of defining _mm_loadu_ps (the intrinsic for unaligned SIMD loads like movups). (You don't need may_alias for loading a vector of float from a float array, but you do need may_alias for loading it from something else.) See also Is `reinterpret_cast`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior?

And see Why does glibc's strlen need to be so complicated to run quickly? for scalar code that I think is safe for under-aligned / aliasing unsigned long, unlike glibc's fallback C implementation. (Which has to be compiled without -flto so it can't inline into other glibc functions and break because of strict-aliasing violation.)

Allocators and assumed alignment

(This section was written assuming that alignof(Edge) < 16. This was not the case here, and the function attributes might be useful to know about even though they're not the cause of the problem. And probably not a viable workaround either.)

You might be able to use __attribute__ ((assume_aligned (8))) on your allocator to tell GCC about the alignment of the pointer it returns.

GCC may possibly be assuming for some reason that your allocator returns memory usable for any object (and alignof(max_align_t) == 16 on x86-64 System V because of long double and other things, and also on Windows x64).

If this is not the case, you may be able to tell it that. This mmap mis-alignment Q&A, we can see that GCC does "know about" malloc and treat it specially. But if your function doesn't have an ISO C or C++ defined name, or GNU C attributes, that would be surprising. IDK, it's the best guess so far based on what you've shown, if it's not a compiler bug. (That is possible.)

From the GCC manual:

void* my_alloc1 (size_t) __attribute__((assume_aligned (16)));
void* my_alloc2 (size_t) __attribute__((assume_aligned (32, 8)));
declares that my_alloc1 returns 16-byte aligned pointers and that my_alloc2 returns a pointer whose value modulo 32 is equal to 8.

I don't know why it would assume that a void* returned by a function and cast to another type would have any more alignment than the type of the object being constructed, though. We can that it uses movups to load an Edge from somewhere. That would seem to indicate that alignof(Edge) < 16.

Also relevant is __attribute__((alloc_size(1))) to tell GCC that the first arg to the function is a size. If your function takes an explicit alignment as an arg, use alloc_align (position) to indicate that, otherwise don't.

`movaps` vs. `movups` in GCC: how does it decide?

Some types can require 16-byte alignment, notably long double

Instruction selection

Allocators and assumed alignment

Some types can require 16-byte alignment, notably `long double`