Search code examples
csseintrinsicssse2

Better way to store or extract scalar int result using SSE2 intrinsic


I'm wondering how load and store efficiently vars when working with SSE2.

In this example, I want to bench the pclmulqdq instruction (carry less multiplication, useful for polynomial arithmetic) vs plain C function, so I need the same "calling convention" that a standard function.

a and b are 16 significant bits, result will have 32 significant bits

#include <wmmintrin.h>

int GFpoly_mul_i(int a, int b) {

 __m128i xa = _mm_loadu_si128( (__m128i*) a);
 __m128i xb = _mm_loadu_si128((__m128i*) b);
 __m128i r = _mm_clmulepi64_si128(xa, xb, 0);

 _MM_ALIGN16 int result[4];
 __m128i* ptr_result = (__m128i*)result;
 _mm_store_si128(ptr_result, r);
 return result[0];
}

Solution

  • Extracting the 32bit integer from the lowest part of a vector can be done easily with _mm_cvtsi128_si32:

    return _mm_cvtsi128_si32(r);
    

    Loading a 32bit integer into the lowest part of a vector can be done with the "opposite" operation, _mm_cvtsi32_si128:

    __m128i xa = _mm_cvtsi32_si128(a);
    

    Loading the integer a into a vector cannot be done with _mm_loadu_si128( (__m128i*) a), this would cast a to a pointer and dereference it (reading a 128bit vector), but a is just an integer value and doesn't point anywhere useful, except perhaps by accident.