Search code examples
c++sseintrinsicssse2

Store four 16bit integers with SSE intrinsics


I multiply and round four 32bit floats, then convert it to four 16bit integers with SSE intrinsics. I'd like to store the four integer results to an array. With floats it's easy: _mm_store_ps(float_ptr, m128value). However I haven't found any instruction to do this with 16bit (__m64) integers.

void process(float *fptr, int16_t *sptr, __m128 factor)
{
  __m128 a = _mm_load_ps(fptr);
  __m128 b = _mm_mul_ps(a, factor);
  __m128 c = _mm_round_ps(b, _MM_FROUND_TO_NEAREST_INT);
  __m64 s =_mm_cvtps_pi16(c);
  // now store the values to sptr
}

Any help would be appreciated.


Solution

  • Personally I would avoid using MMX. Also, I would use an explicit store rather than implicit which often only work on certain compilers. The following codes works find in MSVC2012 and SSE 4.1.

    Note that fptr needs to be 16-byte aligned. This is not a problem if you compile in 64-bit mode but in 32-bit mode you should make sure it's aligned.

    #include <stdio.h>
    #include <stdint.h>
    #include <smmintrin.h>
    
    void process(float *fptr, int16_t *sptr, __m128 factor)
    {
      __m128 a = _mm_load_ps(fptr);
      __m128 b = _mm_mul_ps(a, factor);
      __m128i c = _mm_cvttps_epi32(b);
      __m128i d = _mm_packs_epi32(c,c);
      _mm_storel_epi64((__m128i*)sptr, d);
    }
    
    int main() {
        float x[] = {1.0, 2.0, 3.0, 4.0};
        int16_t y[4];
        __m128 factor = _mm_set1_ps(3.14159f);
        process(x, y, factor);
        printf("%d %d %d %d\n", y[0], y[1], y[2], y[3]);
    }
    

    Note that _mm_cvtps_pi16 is not a simple instrinsic the Intel Intrinsic Guide says "This intrinsic creates a sequence of two or more instructions, and may perform worse than a native instruction. Consider the performance impact of this intrinsic."

    Here is the assembly output using the MMX version

    mulps   (%rdi), %xmm0
    roundps $0, %xmm0, %xmm0
    movaps  %xmm0, %xmm1
    cvtps2pi    %xmm0, %mm0
    movhlps %xmm0, %xmm1
    cvtps2pi    %xmm1, %mm1
    packssdw    %mm1, %mm0
    movq    %mm0, (%rsi)
    ret
    

    Here is the assembly output ussing the SSE only version

    mulps   (%rdi), %xmm0
    cvttps2dq   %xmm0, %xmm0
    packssdw    %xmm0, %xmm0
    movq    %xmm0, (%rsi)
    ret