Search code examples
x86sseintrinsicssse2sse4

Simulating packusdw functionality with SSE2


I'm implementing a fast x888 -> 565 pixel conversion function in pixman according to the algorithm described by Intel [pdf]. Their code converts x888 -> 555 while I want to convert to 565. Unfortunately, converting to 565 means that the high bit is set, which means I can't use signed-saturation pack instructions. The unsigned pack instruction, packusdw wasn't added until SSE4.1. I'd like to implement its functionality with SSE2 or find another way of doing this.

This function takes two XMM registers containing 4 32-bit pixels each and outputs a single XMM register containing the 8 converted RGB565 pixels.

static force_inline __m128i
pack_565_2packedx128_128 (__m128i lo, __m128i hi)
{
    __m128i rb0 = _mm_and_si128 (lo, mask_565_rb);
    __m128i rb1 = _mm_and_si128 (hi, mask_565_rb);

    __m128i t0 = _mm_madd_epi16 (rb0, mask_565_pack_multiplier);
    __m128i t1 = _mm_madd_epi16 (rb1, mask_565_pack_multiplier);

    __m128i g0 = _mm_and_si128 (lo, mask_green);
    __m128i g1 = _mm_and_si128 (hi, mask_green);

    t0 = _mm_or_si128 (t0, g0);
    t1 = _mm_or_si128 (t1, g1);

    t0 = _mm_srli_epi32 (t0, 5);
    t1 = _mm_srli_epi32 (t1, 5);

    /* XXX: maybe there's a way to do this relatively efficiently with SSE2? */
    return _mm_packus_epi32 (t0, t1);
}

Ideas I've thought of:

  • Subtracting 0x8000, _mm_packs_epi32, re-adding 0x8000 to each 565 pixel. I've tried this, but I can't make this work.

      t0 = _mm_sub_epi16 (t0, mask_8000);
      t1 = _mm_sub_epi16 (t1, mask_8000);
      t0 = _mm_packs_epi32 (t0, t1);
      return _mm_add_epi16 (t0, mask_8000);
    
  • Shuffle data instead of packing it. Works for MMX, but since SSE 16-bit shuffles work on only the high or low 64-bits, it would get messy.

  • Save high bits, set them to zero, do the pack, restore them afterwards. Seems quite messy.

Is there some other (hopefully more efficient) way I could do this?


Solution

  • You could sign extend the values first and then use _mm_packs_epi32:

    t0 = _mm_slli_epi32 (t0, 16);
    t0 = _mm_srai_epi32 (t0, 16);
    t1 = _mm_slli_epi32 (t1, 16);
    t1 = _mm_srai_epi32 (t1, 16);
    t0 = _mm_packs_epi32 (t0, t1);
    

    You could actually combine this with the previous shifts to save two instructions:

    t0 = _mm_slli_epi32 (t0, 16 - 5);
    t0 = _mm_srai_epi32 (t0, 16);
    t1 = _mm_slli_epi32 (t1, 16 - 5);
    t1 = _mm_srai_epi32 (t1, 16);
    t0 = _mm_packs_epi32 (t0, t1);