Search code examples
x86sseinstructions

SSE Instructions: Byte+Short


I have very long byte arrays that need to be added to a destination array of type short (or int). Does such SSE instruction exist? Or maybe their set ?


Solution

  • You need to unpack each vector of 8 bit values to two vectors of 16 bit values and then add those.

    __m128i v = _mm_set_epi8(15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0);
    __m128i vl = _mm_unpacklo_epi8(v, _mm_set1_epi8(0)); // vl = { 7, 6, 5, 4, 3, 2, 1, 0 }
    __m128i vh = _mm_unpackhi_epi8(v, _mm_set1_epi8(0)); // vh = { 15, 14, 13, 12, 11, 10, 9, 8 }
    

    where v is a vector of 16 x 8 bit values and vl, vh are the two unpacked vectors of 8 x 16 bit values.

    Note that I'm assuming that the 8 bit values are unsigned so when unpacking to 16 bits the high byte is set to 0 (i.e. no sign extension).

    If you want to sum a lot of these vectors and get a 32 bit result then a useful trick is to use _mm_madd_epi16 with a multiplier of 1, e.g.

    __m128i vsuml = _mm_set1_epi32(0);
    __m128i vsumh = _mm_set1_epi32(0);
    __m128i vsum;
    int sum;
    
    for (int i = 0; i < N; i += 16)
    {
        __m128i v = _mm_load_si128(&x[i]);
        __m128i vl = _mm_unpacklo_epi8(v, _mm_set1_epi8(0));
        __m128i vh = _mm_unpackhi_epi8(v, _mm_set1_epi8(0));
        vsuml = _mm_add_epi32(vsuml, _mm_madd_epi16(vl, _mm_set1_epi16(1)));
        vsumh = _mm_add_epi32(vsumh, _mm_madd_epi16(vh, _mm_set1_epi16(1)));
    }
    // do horizontal sum of 4 partial sums and store in scalar int
    vsum = _mm_add_epi32(vsuml, vsumh);
    vsum = _mm_add_epi32(vsum, _mm_srli_si128(vsum, 8));
    vsum = _mm_add_epi32(vsum, _mm_srli_si128(vsum, 4));
    sum = _mm_cvtsi128_si32(vsum);