visual-studio visual-c++sse intrinsics memory-alignment

Is there a way to force visual studio to generate aligned instructions from SSE intrinsics?

The _mm_load_ps() SSE intrinsic is defined as aligned, throwing exception if the address is not aligned. However, it seems visual studio generates unaligned read instead.

Since not all compilers are made the same, this hides bugs. It would be nice to be able to be able to turn the actual aligned operations on, even though the performance hit that used to be there doesn't seem to be there anymore.

In other words, writing code:

__m128 p1 = _mm_load_ps(data);

currently produces:

movups      xmm0,xmmword ptr [eax]

expected result:

movaps      xmm0,xmmword ptr [eax]

(I was asked by microsoft to ask here)

Solution

MSVC and ICC only use instructions that do alignment checking when they fold a load into a memory source operand without AVX enabled, like addps xmm0, [rax]. SSE memory source operands require alignment, unlike AVX. But you can't reliably control when this happens, and in debug builds it generally doesn't.

As Mysticial points out in Visual Studio 2017: _mm_load_ps often compiled to movups , another case is NT load/store, because there is no unaligned version.

If your code is compatible with clang-cl, have Visual Studio use it instead of MSVC. It's a modified version of clang that tries to act more like MSVC. But like GCC, clang uses aligned load and store instructions for aligned intrinsics.

Either disable optimization, or make sure AVX is not enabled, otherwise it could fold a _mm_load_ps into a memory source operand like vaddps xmm0, [rax] which doesn't require alignment because it's the AVX version. This may be a problem if your code also uses AVX intrinsics in the same file, because clang requires that you enable ISA extensions for intrinsics you want to use; the compiler won't emit asm instructions for an extension that isn't enabled, even with intrinsics. Unlike MSVC and ICC.

A debug build should work even with AVX enabled, especially if you _mm_load_ps or _mm256_load_ps into a separate variable in a separate statement, not v=_mm_add_ps(v, _mm_load_ps(ptr));

With MSVC itself, for debugging purposes only (usually very big speed penalty for stores), you could substitute normal loads/stores with NT. Since they're special, the compiler won't fold loads into memory source operands for ALU instructions, so this can maybe work even with AVX with optimization enabled.

// alignment_debug.h      (untested)
// #include this *after* immintrin.h
#ifdef DEBUG_SIMD_ALIGNMENT
 #warn "using slow alignment-debug SIMD instructions to work around MSVC/ICC limitations"
   // SSE4.1 MOVNTDQA doesn't do anything special on normal WB memory, only WC
   // On WB, it's just a slower MOVDQA, wasting an ALU uop.
 #define _mm_load_si128  _mm_stream_load_si128
 #define _mm_load_ps(ptr)  _mm_castsi128_ps(_mm_stream_load_si128((const __m128i*)ptr))
 #define _mm_load_pd(ptr)  _mm_castsi128_pd(_mm_stream_load_si128((const __m128i*)ptr))

  // SSE1/2 MOVNTPS / PD / MOVNTDQ  evict data from cache if it was hot, and bypass cache
 #define _mm_store_ps  _mm_stream_ps       // SSE1 movntps
 #define _mm_store_pd  _mm_stream_pd       // SSE2 movntpd is a waste of space vs. the ps encoding, but whatever
 #define _mm_store_si128 _mm_stream_si128  // SSE2 movntdq

// and repeat for _mm256_... versions with _mm256_castsi256_ps
// and _mm512_... versions 
// edit welcome if anyone tests this and adds those versions
#endif

Related: for auto-vectorization with MSVC (and gcc/clang), see Alex's answer on Alignment attribute to force aligned load/store in auto-vectorization of GCC/CLang