I multiply and round four 32bit floats, then convert it to four 16bit integers with SSE intrinsics. I'd like to store the four integer results to an array. With floats it's easy: _mm_store_ps(float_ptr, m128value)
. However I haven't found any instruction to do this with 16bit (__m64) integers.
void process(float *fptr, int16_t *sptr, __m128 factor)
{
__m128 a = _mm_load_ps(fptr);
__m128 b = _mm_mul_ps(a, factor);
__m128 c = _mm_round_ps(b, _MM_FROUND_TO_NEAREST_INT);
__m64 s =_mm_cvtps_pi16(c);
// now store the values to sptr
}
Any help would be appreciated.
Personally I would avoid using MMX. Also, I would use an explicit store rather than implicit which often only work on certain compilers. The following codes works find in MSVC2012 and SSE 4.1.
Note that fptr
needs to be 16-byte aligned. This is not a problem if you compile in 64-bit mode but in 32-bit mode you should make sure it's aligned.
#include <stdio.h>
#include <stdint.h>
#include <smmintrin.h>
void process(float *fptr, int16_t *sptr, __m128 factor)
{
__m128 a = _mm_load_ps(fptr);
__m128 b = _mm_mul_ps(a, factor);
__m128i c = _mm_cvttps_epi32(b);
__m128i d = _mm_packs_epi32(c,c);
_mm_storel_epi64((__m128i*)sptr, d);
}
int main() {
float x[] = {1.0, 2.0, 3.0, 4.0};
int16_t y[4];
__m128 factor = _mm_set1_ps(3.14159f);
process(x, y, factor);
printf("%d %d %d %d\n", y[0], y[1], y[2], y[3]);
}
Note that _mm_cvtps_pi16
is not a simple instrinsic the Intel Intrinsic Guide says "This intrinsic creates a sequence of two or more instructions, and may perform worse than a native instruction. Consider the performance impact of this intrinsic."
Here is the assembly output using the MMX version
mulps (%rdi), %xmm0
roundps $0, %xmm0, %xmm0
movaps %xmm0, %xmm1
cvtps2pi %xmm0, %mm0
movhlps %xmm0, %xmm1
cvtps2pi %xmm1, %mm1
packssdw %mm1, %mm0
movq %mm0, (%rsi)
ret
Here is the assembly output ussing the SSE only version
mulps (%rdi), %xmm0
cvttps2dq %xmm0, %xmm0
packssdw %xmm0, %xmm0
movq %xmm0, (%rsi)
ret