Accessing the fields of a __m128i variable in a portable way

I am trying to use SIMD instructions to speed up the sum of elements in an array of uint8_t (i.e., sum reduction). For that purpose, I am replicating the most voted answer in this question:

Sum reduction of unsigned bytes without overflow, using SSE2 on Intel

The procedure for the sum reduction shown in that answer is this:

uint16_t sum_32(const uint8_t a[32])
{
    __m128i zero = _mm_xor_si128(zero, zero);
    __m128i sum0 = _mm_sad_epu8(
                        zero,
                        _mm_load_si128(reinterpret_cast<const __m128i*>(a)));
    __m128i sum1 = _mm_sad_epu8(
                        zero,
                        _mm_load_si128(reinterpret_cast<const __m128i*>(&a[16])));
    __m128i sum2 = _mm_add_epi16(sum0, sum1);
    __m128i totalsum = _mm_add_epi16(sum2, _mm_shuffle_epi32(sum2, 2));
    return totalsum.m128i_u16[0];
}

My problem is that the return operation (totalsum.m128i_u16[0]) seems to be available only for Microsoft, but I am using UNIX-based platforms.

I reviewed the list of SIMD intrinsics, and the function _mm_storeu_ps(a, t) seems to do something similar to what I require, but t has to be a __m128 variable and a a float. I tried to use that function by casting my result from __m128i to __m128, but it didn't work.

Is there another way in which I can retrieve the first 16 bits of a __m128i variable and store them into a uint16_t variable?. I am very new to SIMD programming.

BTW, is there a better solution for implementing sum reduction?. That answer is from 9 years ago. I imagine that now are better alternatives.

Solution

_mm_extract_epi16 for a compile-time known index.

For the first element _mm_cvtsi128_si32 gives more efficient instructions. This would work, given that:

_mm_sad_epu8 fills the the bits 16 thru 63 to zero
you truncate the result to 16 bits via uint16_t return type

Compilers may be able to do this optimization on their own, based on either of the reasons, but not all of them, so it is better to use _mm_cvtsi128_si32.