I'm wondering how load and store efficiently vars when working with SSE2.
In this example, I want to bench the pclmulqdq
instruction (carry less multiplication, useful for polynomial arithmetic) vs plain C function, so I need the same "calling convention" that a standard function.
a and b are 16 significant bits, result will have 32 significant bits
#include <wmmintrin.h>
int GFpoly_mul_i(int a, int b) {
__m128i xa = _mm_loadu_si128( (__m128i*) a);
__m128i xb = _mm_loadu_si128((__m128i*) b);
__m128i r = _mm_clmulepi64_si128(xa, xb, 0);
_MM_ALIGN16 int result[4];
__m128i* ptr_result = (__m128i*)result;
_mm_store_si128(ptr_result, r);
return result[0];
}
Extracting the 32bit integer from the lowest part of a vector can be done easily with _mm_cvtsi128_si32
:
return _mm_cvtsi128_si32(r);
Loading a 32bit integer into the lowest part of a vector can be done with the "opposite" operation, _mm_cvtsi32_si128
:
__m128i xa = _mm_cvtsi32_si128(a);
Loading the integer a
into a vector cannot be done with _mm_loadu_si128( (__m128i*) a)
, this would cast a
to a pointer and dereference it (reading a 128bit vector), but a
is just an integer value and doesn't point anywhere useful, except perhaps by accident.