Safe and efficient way to use SIMD intrinsics on an exisiting float array

I am learning about SSE and AVX to further improve the performance of some of the computations in my code.

However, I have come across multiple different ways to use the SSE instructions on an existing array of floats. I would like to know which of these are safe (no UB) and efficient. I have marked the lines of code that's different across versions with a <-- arrow comment. Also given a link to godbolt example.

Version 1 :- Using _mm_load

#include <immintrin.h>
#include <iostream>

int main()
{
    __m128 simd = _mm_set1_ps(10.0f) ;
    alignas(16) float float_arr[4] = {0, 1, 2, 3} ;
    __m128 load_simd = _mm_load_ps(float_arr) ; // <-------
    __m128 sum = _mm_add_ps(simd, load_simd) ;
    alignas(16) float float_arr_sum[4] ;
    _mm_store_ps(float_arr_sum, sum) ;
    std::cout << float_arr_sum[0] << ", " << float_arr_sum[1] << ", " << float_arr_sum[2] << ", " << float_arr_sum[3] << std::endl ;
}

Version 2 :- Using __m128& (reference)

#include <immintrin.h>
#include <iostream>

int main()
{
    __m128 simd = _mm_set1_ps(10.0f) ;
    alignas(16) float float_arr[4] = {0, 1, 2, 3} ;
    __m128& cast_ref_simd = reinterpret_cast<__m128&>(float_arr[0]) ; // <-----
    __m128 sum = _mm_add_ps(simd, cast_ref_simd) ;
    alignas(16) float float_arr_sum[4] ;
    _mm_store_ps(float_arr_sum, sum) ;
    std::cout << float_arr_sum[0] << ", " << float_arr_sum[1] << ", " << float_arr_sum[2] << ", " << float_arr_sum[3] << std::endl ;
}

Version 3 :- Using __m128 (pointer)*

#include <immintrin.h>
#include <iostream>

int main()
{
    __m128 simd = _mm_set1_ps(10.0f) ;
    alignas(16) float float_arr[4] = {0, 1, 2, 3} ;
    __m128* cast_ptr_simd = reinterpret_cast<__m128*>(float_arr) ; // <-------
    __m128 sum = _mm_add_ps(simd, *cast_ptr_simd) ; // <-------
    alignas(16) float float_arr_sum[4] ;
    _mm_store_ps(float_arr_sum, sum) ;
    std::cout << float_arr_sum[0] << ", " << float_arr_sum[1] << ", " << float_arr_sum[2] << ", " << float_arr_sum[3] << std::endl ;
}

I tested all these in Compiler Explorer and saw that for versions 2 and 3 there is an extra instruction of moves ( movaps, move aligned packed single-precision floating-point) between xmm0 and xmm1 registers. They also generate the same assembly output. For optimizations including and above 1 (-O1, -O2, -O3...), all three versions generate the same code.

For my project I am using -O3, so I guess the compiled output will not change but I want to have a proper understanding if possible.

I am also looking at some popular extra libraries like Agner Fog's header only lib and Vc(which will become std::simd at some point) but I want to learn some low-level experience before using them.

Solution

All 3 versions are exactly equivalent to the compiler in this use-case where you only read the reference / deref the pointer once, without doing any stores between it and the cast or load.

See Is `reinterpret_cast`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior? (no UB, it's safe.)

So the choice comes down to style and readability. _mm_load_ps, or _mm_loadu_ps for unaligned, are standard. The compiler can still fold a _mm_load_ps into a memory source operand for addps xmm0, [rdi] or whatever.

If you cast to raw __m128 *p and deref with *p, that only works for aligned (equivalent to _mm_load_ps not loadu). If you wanted to modify your code to allow an unaligned input (e.g. a pointer to the middle of an array, which is something you might want even if all your arrays are aligned), you'd have to change your code significantly to use _mm_loadu_ps. That intrinsic wants a float*, so you'd actually have to cast _mm_loadu_ps( (float*)p ).

Some people like to increment a __m128i *ptr through their array for integer where the load[u] / store[u] intrinsics take __m128i* rather than void*. But most code still uses the load / store intrinsics instead of a raw deref, mostly just to make it more visible.

Using a reference instead of a load seems like a terrible habit. Usually you want the compiler to load once, not keep re-referencing memory every time you read a variable, even after storing to other locations. Using a reference would force it to do alias analysis (figure out two arrays can't overlap, or that a specific store can't overlap with any __m128& references) if it wants to optimize the referenced __m128 value into a register that only gets loaded once in the asm. (It's still exactly equivalent here because you only read the reference once, with no intervening stores.)

A __m128 reference is unusual and would easily be confusing. People maintaining the code later might forget that it's a reference not a load result, and introduce bugs by reading it again after a store that might overlap. Or at least make code less efficient if the compiler loads again because it can't prove that a store couldn't have been pointing to the referenced floats.

I tend to write either of these ways

    __m128 v = _mm_load_ps( ptr );    // when doing pointer increments
// or
    __m128 v = _mm_load_ps( &arr[i] );  // when using integer indices

// then do stuff to the load result, maybe declaring other __m128 temporaries

Since you do often want to declare other __m128 temporaries when you're doing something non-trivial, it's nice to have the load result be another __m128 like those, so using it multiple times hints the compiler in the direction of loading once into a register. It might still optimize multiple *ptr derefs into one load, but writing the source as close as possible to the efficient asm that I want seems like a good idea and sometimes helps. A __m128 &v reference would be even worse, hiding the difference between reading a local var which is hopefully in a register vs. re-accessing memory.

For trivial stuff, compilers normally auto-vectorize pretty well so you often don't need intrinsics at all.

It's useless to look at unoptimized asm, especially from intrinsics, since _mm_load_ps is an actual function wrapped around the dereference. It will optimize away if you enable optimization, but if not there's an extra return-value object that might make the anti-optimized debug-build asm even worse. Intrinsics with optimization disabled is a total disaster for asm efficiency in general.