Search code examples
c++ssesimdintrinsicssse2

how to set a int32 value at some index within an m128i with only SSE2?


Is there a SSE2 intrinsics that can set a single int32 value within m128i?

Such as set value 1000 at index 1 on a m128i that already contains 1,2,3,4? (which result in 1,1000,3,4)


Solution

  • If SSE4 available, use

    __m128i _mm_insert_epi32 (__m128i a, int i, const int imm8)

    If you are limited to SSE2, you need to split it to two calls of

    __m128i _mm_insert_epi16 (__m128i a, int i, const int imm8)

    return _mm_insert_epi16(_mm_insert_epi16(a, 1000, 2), 0, 3);
    

    to set 1000 on lane 1 of a interpreted as vector of ints.

    With SSE3 available, I would presume that a shifting/shuffling sequence would be more efficient:

    a = _mm_shuffle_epi32(a, 0xe0);      // shuffle as 3 2 0 ?
    __m128i b = _mm_cvtsi32_si128(value);
    b = _mm_alignr_epi8(b, a, 4);        // value 3 2 0
    return  _mm_shuffle_epi32(b, 0x5c);  // 3 2 value 0
    

    If the value is in a 64-bit register, one can use _mm_cvtsi64_si128 instead.

    Gcc is able to convert the store load sequence to pinsrd xmm0, eax, 1 when SSE4 enabled, but gives quite a convoluted sequence without -msse4.

        movd    eax, xmm0
        movaps  XMMWORD PTR [rsp-24], xmm0
        movabs  rdx, 4294967296000
        or      rax, rdx
        mov     QWORD PTR [rsp-24], rax
        movdqa  xmm0, XMMWORD PTR [rsp-24]
        ret 
    

    OTOH clang respects the store, modify stack, load paradigm.

        movaps  xmmword ptr [rsp - 24], xmm0
        mov     dword ptr [rsp - 20], 1000
        movaps  xmm0, xmmword ptr [rsp - 24]
        ret
    

    Probably the overall winner is the store/modify/load combo, which also has free programmable index. All others require hard coded immediates, including those using the insert intrinsics.

    __m128i store_modify_load(__m128i a, int value, size_t index) {
       alignas(16) int32_t tmp[4] = {};
       _mm_store_si128(reinterpret_cast<__m128i*>(tmp), a);
       tmp[index] = value;
       return  _mm_load_si128(reinterpret_cast<__m128i*>(tmp));
    }
    

    See the produced assembly in godbolt.