Better way to store or extract scalar int result using SSE2 intrinsic

I'm wondering how load and store efficiently vars when working with SSE2.

In this example, I want to bench the pclmulqdq instruction (carry less multiplication, useful for polynomial arithmetic) vs plain C function, so I need the same "calling convention" that a standard function.

a and b are 16 significant bits, result will have 32 significant bits

#include <wmmintrin.h>

int GFpoly_mul_i(int a, int b) {

 __m128i xa = _mm_loadu_si128( (__m128i*) a);
 __m128i xb = _mm_loadu_si128((__m128i*) b);
 __m128i r = _mm_clmulepi64_si128(xa, xb, 0);

 _MM_ALIGN16 int result[4];
 __m128i* ptr_result = (__m128i*)result;
 _mm_store_si128(ptr_result, r);
 return result[0];
}

Solution

Extracting the 32bit integer from the lowest part of a vector can be done easily with _mm_cvtsi128_si32:

return _mm_cvtsi128_si32(r);

Loading a 32bit integer into the lowest part of a vector can be done with the "opposite" operation, _mm_cvtsi32_si128:

__m128i xa = _mm_cvtsi32_si128(a);

Loading the integer a into a vector cannot be done with _mm_loadu_si128( (__m128i*) a), this would cast a to a pointer and dereference it (reading a 128bit vector), but a is just an integer value and doesn't point anywhere useful, except perhaps by accident.