Search code examples
c++x86ssesimdintrinsics

SSE2 intrinsics - comparing unsigned integers


I'm interested in identifying overflowing values when adding unsigned 8-bit integers, and clamping the result to 0xFF:

__m128i m1 = _mm_loadu_si128(/* 16 8-bit unsigned integers */);
__m128i m2 = _mm_loadu_si128(/* 16 8-bit unsigned integers */);

__m128i m3 = _mm_adds_epu8(m1, m2);

I would be interested in performing comparison for "less than" on these unsigned integers, similar to _mm_cmplt_epi8 for signed:

__m128i mask = _mm_cmplt_epi8 (m3, m1);
m1 = _mm_or_si128(m3, mask);

If an "epu8" equivalent was available, mask would have 0xFF where m3[i] < m1[i] (overflow!), 0x00 otherwise, and we would be able to clamp m1 using the "or", so m1 will hold the addition result where valid, and 0xFF where it overflowed.

Problem is, _mm_cmplt_epi8 performs a signed comparison, so for instance if m1[i] = 0x70 and m2[i] = 0x10, then m3[i] = 0x80 and mask[i] = 0xFF, which is obviously not what I require.

Using VS2012.

I would appreciate another approach for performing this. Thanks!


Solution

  • One way of implementing compares for unsigned 8 bit vectors is to exploit _mm_max_epu8, which returns the maximum of unsigned 8 bit int elements. You can compare for equality the (unsigned) maximum value of two elements with one of the source elements and then return the appropriate result. This translates to 2 instructions for >= or <=, and 3 instructions for > or <.

    Example code:

    #include <stdio.h>
    #include <emmintrin.h>    // SSE2
    
    #define _mm_cmpge_epu8(a, b) \
            _mm_cmpeq_epi8(_mm_max_epu8(a, b), a)
    
    #define _mm_cmple_epu8(a, b) _mm_cmpge_epu8(b, a)
    
    #define _mm_cmpgt_epu8(a, b) \
            _mm_xor_si128(_mm_cmple_epu8(a, b), _mm_set1_epi8(-1))
    
    #define _mm_cmplt_epu8(a, b) _mm_cmpgt_epu8(b, a)
    
    int main(void)
    {
        __m128i va = _mm_setr_epi8(0,   0,   1,   1,   1, 127, 127, 127, 128, 128, 128, 254, 254, 254, 255, 255);
        __m128i vb = _mm_setr_epi8(0, 255,   0,   1, 255,   0, 127, 255,   0, 128, 255,   0, 254, 255,   0, 255);
    
        __m128i v_ge = _mm_cmpge_epu8(va, vb);
        __m128i v_le = _mm_cmple_epu8(va, vb);
        __m128i v_gt = _mm_cmpgt_epu8(va, vb);
        __m128i v_lt = _mm_cmplt_epu8(va, vb);
    
        printf("va   = %4vhhu\n", va);
        printf("vb   = %4vhhu\n", vb);
        printf("v_ge = %4vhhu\n", v_ge);
        printf("v_le = %4vhhu\n", v_le);
        printf("v_gt = %4vhhu\n", v_gt);
        printf("v_lt = %4vhhu\n", v_lt);
    
        return 0;
    }
    

    Compile and run:

    $ gcc -Wall _mm_cmplt_epu8.c && ./a.out 
    va   =    0    0    1    1    1  127  127  127  128  128  128  254  254  254  255  255
    vb   =    0  255    0    1  255    0  127  255    0  128  255    0  254  255    0  255
    v_ge =  255    0  255  255    0  255  255    0  255  255    0  255  255    0  255  255
    v_le =  255  255    0  255  255    0  255  255    0  255  255    0  255  255    0  255
    v_gt =    0    0  255    0    0  255    0    0  255    0    0  255    0    0  255    0
    v_lt =    0  255    0    0  255    0    0  255    0    0  255    0    0  255    0    0