Is there a SSE2 intrinsics that can set a single int32 value within m128i?
Such as set value 1000 at index 1 on a m128i that already contains 1,2,3,4? (which result in 1,1000,3,4)
If SSE4 available, use
__m128i _mm_insert_epi32 (__m128i a, int i, const int imm8)
If you are limited to SSE2, you need to split it to two calls of
__m128i _mm_insert_epi16 (__m128i a, int i, const int imm8)
return _mm_insert_epi16(_mm_insert_epi16(a, 1000, 2), 0, 3);
to set 1000 on lane 1 of a
interpreted as vector of ints.
With SSE3 available, I would presume that a shifting/shuffling sequence would be more efficient:
a = _mm_shuffle_epi32(a, 0xe0); // shuffle as 3 2 0 ?
__m128i b = _mm_cvtsi32_si128(value);
b = _mm_alignr_epi8(b, a, 4); // value 3 2 0
return _mm_shuffle_epi32(b, 0x5c); // 3 2 value 0
If the value
is in a 64-bit register, one can use _mm_cvtsi64_si128
instead.
Gcc is able to convert the store load sequence to pinsrd xmm0, eax, 1
when SSE4 enabled, but gives quite a convoluted sequence without -msse4
.
movd eax, xmm0
movaps XMMWORD PTR [rsp-24], xmm0
movabs rdx, 4294967296000
or rax, rdx
mov QWORD PTR [rsp-24], rax
movdqa xmm0, XMMWORD PTR [rsp-24]
ret
OTOH clang respects the store, modify stack, load paradigm.
movaps xmmword ptr [rsp - 24], xmm0
mov dword ptr [rsp - 20], 1000
movaps xmm0, xmmword ptr [rsp - 24]
ret
Probably the overall winner is the store/modify/load combo, which also has free programmable index. All others require hard coded immediates, including those using the insert intrinsics.
__m128i store_modify_load(__m128i a, int value, size_t index) {
alignas(16) int32_t tmp[4] = {};
_mm_store_si128(reinterpret_cast<__m128i*>(tmp), a);
tmp[index] = value;
return _mm_load_si128(reinterpret_cast<__m128i*>(tmp));
}
See the produced assembly in godbolt.