Search code examples
simdsseintrinsicssse2

can I assign the result of intrinsic that returns __m128i to variable of the type__m128i_u?


as in the title - I want to do as below:

__m128i_u* avxVar = (__m128i_u*)Var;  // Var allocated with alloc
*avxVar = _mm_set_epi64(...);         // is that ok to assign __m128i to __m128i_u ?

Solution

  • Yes, but note that __m128i_u is not portable (e.g. to MSVC); it's what GCC/clang use internally to implement unaligned loadu/storeu intrinsics. It's exactly equivalent to do it the normal way:

    _mm_storeu_si128((__m128i*)Var, vec);
    

    (where vec is any __m128i. e.g. it could be _mm_set_epi64x or a variable.)

    GCC 11's emmintrin.h implementation of _mm_storeu_si128 is defined like this, taking a __m128i_u* pointer arg, so the dereference does an unaligned access (if not optimized away).

    // GCC internals, quoted for reference only.
    // Just use #include <immintrin.h> in your own code
    extern __inline void __attribute__((__gnu_inline__, __always_inline__, __artificial__))
    _mm_storeu_si128 (__m128i_u *__P, __m128i __B)
    {
      *__P = __B;
    }
    

    So yes, GCC's headers depend on __m128i* and __m128i_u* being compatible and implicitly convertible.

    As much as _mm_storeu_si128 is an intrinsic for movdqu, so is a __m128i_u* dereference. But really these intrinsics just exist to communicate alignment information to the compiler, and it's up to the compiler to decide when to actually load and store, just like with deref of char*.

    (Fun fact: __m128i* is a may_alias type, like char*, so you can point it at anything without violating strict-aliasing. Is `reinterpret_cast`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior?)


    Also note that _mm_set_epi64 takes __m64 args: it was for building an SSE2 vector from two MMX vectors, not from scalar int64_t. You probably want _mm_set_epi64x


    They compile identically

    void foo(void *Var) {
        __m128i_u* avxVar = (__m128i_u*)Var;
        *avxVar = _mm_set_epi64x(1, 2); 
    }
    
    void bar(void *Var) {
        _mm_storeu_si128((__m128i*)Var, _mm_set_epi64x(1, 2) );
    }
    

    Both functions compile identically (and are semantically equivalent so will always be the same after inlining) across gcc/clang/MSVC. But only the 2nd one compiles at all with MSVC, as you can see on the Godbolt compiler explorer: https://godbolt.org/z/Y8Wq96Pqs . if you disable the #ifdef __GNUC__, you get compiler errors on MSVC.

    ## GCC -O3
    foo:
            movdqa  xmm0, XMMWORD PTR .LC0[rip]
            movups  XMMWORD PTR [rdi], xmm0
            ret
    bar:
            movdqa  xmm0, XMMWORD PTR .LC0[rip]
            movups  XMMWORD PTR [rdi], xmm0
            ret
    .LC0:
            .quad   2
            .quad   1
    

    With more complex surrounding code, _mm_loadu_si128 can fold into a memory source operand for ALU only with AVX (e.g. vpaddb xmm0, xmm1, [rdi], but _mm_load_si128 aligned loads can fold into SSE memory sources like paddb xmm0, [rdi].