I am learning about SSE and AVX to further improve the performance of some of the computations in my code.
However, I have come across multiple different ways to use the SSE instructions on an existing array of floats. I would like to know which of these are safe (no UB) and efficient. I have marked the lines of code that's different across versions with a <-- arrow comment. Also given a link to godbolt example.
Version 1 :- Using _mm_load
#include <immintrin.h>
#include <iostream>
int main()
{
__m128 simd = _mm_set1_ps(10.0f) ;
alignas(16) float float_arr[4] = {0, 1, 2, 3} ;
__m128 load_simd = _mm_load_ps(float_arr) ; // <-------
__m128 sum = _mm_add_ps(simd, load_simd) ;
alignas(16) float float_arr_sum[4] ;
_mm_store_ps(float_arr_sum, sum) ;
std::cout << float_arr_sum[0] << ", " << float_arr_sum[1] << ", " << float_arr_sum[2] << ", " << float_arr_sum[3] << std::endl ;
}
Version 2 :- Using __m128& (reference)
#include <immintrin.h>
#include <iostream>
int main()
{
__m128 simd = _mm_set1_ps(10.0f) ;
alignas(16) float float_arr[4] = {0, 1, 2, 3} ;
__m128& cast_ref_simd = reinterpret_cast<__m128&>(float_arr[0]) ; // <-----
__m128 sum = _mm_add_ps(simd, cast_ref_simd) ;
alignas(16) float float_arr_sum[4] ;
_mm_store_ps(float_arr_sum, sum) ;
std::cout << float_arr_sum[0] << ", " << float_arr_sum[1] << ", " << float_arr_sum[2] << ", " << float_arr_sum[3] << std::endl ;
}
Version 3 :- Using __m128 (pointer)*
#include <immintrin.h>
#include <iostream>
int main()
{
__m128 simd = _mm_set1_ps(10.0f) ;
alignas(16) float float_arr[4] = {0, 1, 2, 3} ;
__m128* cast_ptr_simd = reinterpret_cast<__m128*>(float_arr) ; // <-------
__m128 sum = _mm_add_ps(simd, *cast_ptr_simd) ; // <-------
alignas(16) float float_arr_sum[4] ;
_mm_store_ps(float_arr_sum, sum) ;
std::cout << float_arr_sum[0] << ", " << float_arr_sum[1] << ", " << float_arr_sum[2] << ", " << float_arr_sum[3] << std::endl ;
}
I tested all these in Compiler Explorer and saw that for versions 2 and 3 there is an extra instruction of moves ( movaps
, move aligned packed single-precision floating-point) between xmm0 and xmm1 registers. They also generate the same assembly output.
For optimizations including and above 1 (-O1
, -O2
, -O3
...), all three versions generate the same code.
For my project I am using -O3
, so I guess the compiled output will not change but I want to have a proper understanding if possible.
I am also looking at some popular extra libraries like Agner Fog's header only lib and Vc(which will become std::simd
at some point) but I want to learn some low-level experience before using them.
All 3 versions are exactly equivalent to the compiler in this use-case where you only read the reference / deref the pointer once, without doing any stores between it and the cast or load.
See Is `reinterpret_cast`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior? (no UB, it's safe.)
So the choice comes down to style and readability. _mm_load_ps
, or _mm_loadu_ps
for unaligned, are standard. The compiler can still fold a _mm_load_ps
into a memory source operand for addps xmm0, [rdi]
or whatever.
If you cast to raw __m128 *p
and deref with *p
, that only works for aligned (equivalent to _mm_load_ps
not loadu
). If you wanted to modify your code to allow an unaligned input (e.g. a pointer to the middle of an array, which is something you might want even if all your arrays are aligned), you'd have to change your code significantly to use _mm_loadu_ps
. That intrinsic wants a float*
, so you'd actually have to cast _mm_loadu_ps( (float*)p )
.
Some people like to increment a __m128i *ptr
through their array for integer where the load[u]
/ store[u]
intrinsics take __m128i*
rather than void*
. But most code still uses the load / store intrinsics instead of a raw deref, mostly just to make it more visible.
Using a reference instead of a load seems like a terrible habit. Usually you want the compiler to load once, not keep re-referencing memory every time you read a variable, even after storing to other locations. Using a reference would force it to do alias analysis (figure out two arrays can't overlap, or that a specific store can't overlap with any __m128&
references) if it wants to optimize the referenced __m128
value into a register that only gets loaded once in the asm. (It's still exactly equivalent here because you only read the reference once, with no intervening stores.)
A __m128
reference is unusual and would easily be confusing. People maintaining the code later might forget that it's a reference not a load result, and introduce bugs by reading it again after a store that might overlap. Or at least make code less efficient if the compiler loads again because it can't prove that a store couldn't have been pointing to the referenced floats.
I tend to write either of these ways
__m128 v = _mm_load_ps( ptr ); // when doing pointer increments
// or
__m128 v = _mm_load_ps( &arr[i] ); // when using integer indices
// then do stuff to the load result, maybe declaring other __m128 temporaries
Since you do often want to declare other __m128
temporaries when you're doing something non-trivial, it's nice to have the load result be another __m128
like those, so using it multiple times hints the compiler in the direction of loading once into a register. It might still optimize multiple *ptr
derefs into one load, but writing the source as close as possible to the efficient asm that I want seems like a good idea and sometimes helps. A __m128 &v
reference would be even worse, hiding the difference between reading a local var which is hopefully in a register vs. re-accessing memory.
For trivial stuff, compilers normally auto-vectorize pretty well so you often don't need intrinsics at all.
It's useless to look at unoptimized asm, especially from intrinsics, since _mm_load_ps
is an actual function wrapped around the dereference. It will optimize away if you enable optimization, but if not there's an extra return-value object that might make the anti-optimized debug-build asm even worse. Intrinsics with optimization disabled is a total disaster for asm efficiency in general.