Search code examples
c++videossesimdsse2

How to achieve 8bit madd using SSE2


Reading from the official Intel C++ Intrinsic Reference,

SSE 2 has the following command

__m128i _mm_madd_epi16(__m128i a, __m128i b)

Multiplies the 8 signed 16-bit integers from a by the 8 signed 16-bit integers from b. Adds the signed 32-bit integer results pairwise and packs the 4 signed 32-bit integer results.

while SSE 3 has

__m128i _mm_maddubs_epi16 (__m128i a, __m128i b)

Multiply signed and unsigned bytes, add horizontal pair of signed words, pack saturated signed words.

Since Im working with 8bit pixels and I must only use SSE 2(old architecture is the target) I need an 8bit madd instruction. How would I proceed with that?


Solution

  • Hope this works - I don't have a compiler here. But even if I missed something, you should get the overall idea.

    EDIT: Thanks to @Peter Cordes for pointing out that _mm_setzero_si128 should better be used directly.

    inline __m128i _mm_madd_epi8_SSE2(const __m128i & a, const __m128i & b)
    {
        // a = 0x00 0x01 0xFE 0x04 ...
        // b = 0x00 0x02 0x80 0x84 ...
    
        // To extend signed 8-bit value, MSB has to be set to 0xFF
        __m128i sign_mask_a  = _mm_cmplt_epi8(a, _mm_setzero_si128());
        __m128i sign_mask_b  = _mm_cmplt_epi8(b, _mm_setzero_si128());
    
        // sign_mask_a = 0x00 0x00 0xFF 0x00 ...
        // sign_mask_b = 0x00 0x00 0xFF 0xFF ...
    
        // Unpack positives with 0x00, negatives with 0xFF
        __m128i a_epi16_l    = _mm_unpacklo_epi8(a, sign_mask_a);
        __m128i a_epi16_h    = _mm_unpackhi_epi8(a, sign_mask_a);
        __m128i b_epi16_l    = _mm_unpacklo_epi8(b, sign_mask_b);
        __m128i b_epi16_h    = _mm_unpackhi_epi8(b, sign_mask_b);
    
        // Here - valid 16-bit signed integers corresponding to the 8-bit input
        // a_epi16_l = 0x00 0x00 0x01 0x00 0xFE 0xFF 0x04 0x00 ... 
    
        // Get the a[i] * b[i] + a[i+1] * b[i+1] for both low and high parts
        __m128i madd_epi32_l = _mm_madd_epi16(a_epi16_l, b_epi16_l);
        __m128i madd_epi32_h = _mm_madd_epi16(a_epi16_h, b_epi16_h);
    
        // Now go back from 32-bit values to 16-bit values & signed saturate
        return _mm_packs_epi16(madd_epi32_l, madd_epi32_h);
    }