I am trying to use SIMD instructions to speed up the sum of elements in an array of uint8_t (i.e., sum reduction). For that purpose, I am replicating the most voted answer in this question:
Sum reduction of unsigned bytes without overflow, using SSE2 on Intel
The procedure for the sum reduction shown in that answer is this:
uint16_t sum_32(const uint8_t a[32])
{
__m128i zero = _mm_xor_si128(zero, zero);
__m128i sum0 = _mm_sad_epu8(
zero,
_mm_load_si128(reinterpret_cast<const __m128i*>(a)));
__m128i sum1 = _mm_sad_epu8(
zero,
_mm_load_si128(reinterpret_cast<const __m128i*>(&a[16])));
__m128i sum2 = _mm_add_epi16(sum0, sum1);
__m128i totalsum = _mm_add_epi16(sum2, _mm_shuffle_epi32(sum2, 2));
return totalsum.m128i_u16[0];
}
My problem is that the return operation (totalsum.m128i_u16[0]
) seems to be available only for Microsoft, but I am using UNIX-based platforms.
I reviewed the list of SIMD intrinsics, and the function _mm_storeu_ps(a, t)
seems to do something similar to what I require, but t
has to be a __m128 variable and a a
float. I tried to use that function by casting my result from __m128i to __m128, but it didn't work.
Is there another way in which I can retrieve the first 16 bits of a __m128i variable and store them into a uint16_t variable?. I am very new to SIMD programming.
BTW, is there a better solution for implementing sum reduction?. That answer is from 9 years ago. I imagine that now are better alternatives.
_mm_extract_epi16
for a compile-time known index.
For the first element _mm_cvtsi128_si32
gives more efficient instructions. This would work, given that:
_mm_sad_epu8
fills the the bits 16 thru 63 to zerouint16_t
return typeCompilers may be able to do this optimization on their own, based on either of the reasons, but not all of them, so it is better to use _mm_cvtsi128_si32
.