c++arrays performance comparison intel-ipp

High performance comparison of signed int arrays (using Intel IPP library)

We're trying to compare two equally sized native arrays of signed int values using inequality operations, <, <=, > and >=, in a high performance way. As many values are compared, the true/false results would be sotred in a char array of the same size of the input, where 0x00 means false and 0xff means true.

To accomplish this, we're using the Intel IPP library. The problem is that the function we found that does this operation, named ippiCompare_*, from the images and video processing lib, supports only the types unsigned char (Ipp8u), signed/unsigned short (Ipp16s/Ipp16u) and float (Ipp32f). It does not directly support signed int (Ipp32s)

I (only) envision two possible ways of solving this:

Casting the array to one of the directly supported types and executing the comparison in more steps (it would became a short array of twice the size or a char array of four times the size) and merging intermediate results.
Using another function directly supporting signed int arrays from IPP or from another library that could do something equivalent in terms of performance.

But there may be other creative ways... So I ask you're help with that! :)

PS: The advantage of using Intel IPP is the performance gain for large arrays: it uses multi-value processor functions and many cores simultaneously (and maybe more tricks). So simple looped solutions wouldn't do it as fast AFAIK.

PS2: link for the ippiCompare_* doc

Solution

You could do the comparison with PCMPEQD followed by a PACKUSDW and PACKUSWB. This would be something along

#include <emmintrin.h>

void cmp(__m128d* a, __m128d* b, v16qi* result, unsigned count) {
    for (unsigned i=0; i < count/16; ++i) {
        __m128d result0 = _mm_cmpeq_pd(a[0], b[0]);  // each line compares 4 integers
        __m128d result1 = _mm_cmpeq_pd(a[1], b[1]);
        __m128d result2 = _mm_cmpeq_pd(a[2], b[2]);
        __m128d result3 = _mm_cmpeq_pd(a[3], b[3]);
        a += 4; b+= 4;

        v8hi wresult0 = __builtin_ia32_packssdw(result0, result1);  //pack 2*4 integer results into 8 words
        v8hi wresult1 = __builtin_ia32_packssdw(result0, result1);

        *result = __builtin_ia32_packsswb(wresult0, wresult1);  //pack 2*8 word results into 16 bytes
        result++;
    }
}

Needs aligned pointers, a count divisible by 16, some typecasts I have omitted because of lazyness/stupidity and probably a lot of debugging, of course. And I didn't find the intrinsics for packssdw/wb, so I just used the builtins from my compiler.