C/C++ fast absolute difference between two series

i am interested in generating efficient c/c++ code to get the differences between two time series. More precise: The time series values are stored as uint16_t arrays with fixed and equal length == 128.

I am good with a pure c as well as a pure c++ implementation. My code examples are in c++

My intentions are:

Let A,B and C be discrete time series of length l with a value-type of uint16_t.
Vn[n<l]: Cn = |An - Bn|;

What i can think of in pseudo code:

for index i:
 if a[i] > b[i]:
    c[i] = a[i] - b[i]
 else:
    c[i] = b[i] - a[i]

Or in c/c++

for(uint8_t idx = 0; idx < 128; idx++){
    c[i] = a[i] > b[i] ? a[i] - b[i] : b[i] - a[i];
}

But i really dont like the if/else statement in the loop. I am okay with looping - this can be unrolled by the compiler. Somewhat like:

void getBufDiff(const uint16_t (&a)[], const uint16_t (&b)[], uint16_t (&c)[]) {
#pragma unroll 16
    for (uint8_t i = 0; i < 128; i++) {
        c[i] = a[i] > b[i] ? a[i] - b[i] : b[i] - a[i];
    }
#end pragma
}

What i am looking for is a 'magic code' which speeds up the if/else and gets me the absolute difference between the two unsigned values.

I am okay with a +/- 1 precision (In case this would allow some bit-magic to happen). I am also okay with changing the data-type to get faster results. And i am also okay with dropping the loop for something else.

So something like:

void getBufDiff(const uint16_t (&a)[], const uint16_t (&b)[], uint16_t (&c)[]) {
#pragma unroll 16
    for (uint8_t i = 0; i < 128; i++) {
        c[i] = magic_code_for_abs_diff(a[i],b[i]);
    }
#end pragma
}

Did try XORing the two values. Gives proper results only for one of the cases.

EDIT 2:

Did a quick test on different approaches on my Laptop.

For 250000000 entrys this is the performance (256 rounds):

c[i] = a[i] > b[i] ? a[i] - b[i] : b[i] - a[i];  ~500ms
c[i] = std::abs(a[i] - b[i]);                    ~800ms
c[i] = ((a[i] - b[i]) + ((a[i] - b[i]) >> 15)) ^ (i >> 15) ~425ms
uint16_t tmp = (a[i] - b[i]); c[i] = tmp * ((tmp > 0) - (tmp < 0)); ~600ms
uint16_t ret[2] = { a[i] - b[i], b[i] - a[i] };c[i] = ret[a[i] < b[i]] ~900ms
c[i] = ((a[i] - b[i]) >> 31 | 1) * (a[i] - b[i]); ~375ms
c[i] = ((a[i] - b[i])) ^ ((a[i] - b[i]) >> 15) ~425ms

Solution

Try to let the compiler see the conditional lane-selection pattern for SIMD instructions like this (pseudo code):

// store a,b to SIMD registers
for(0 to 32)
   a[...] = input[...]
   b[...] = input2[...]

// single type operation, easily parallelizable
for(0 to 32)
   vector1[...] = a[...] - b[...]

// single type operation, easily parallelizable
// maybe better to compute b-a to decrease dependency to first step
// since a and b are already in SIMD registers
for(0 to 32)
   vector2[...] = -vector1[...]

// single type operation, easily parallelizable
// re-use a,b registers, again
for(0 to 32)
   vector3[...] = a[...] < b[...]

// x84 architecture has SIMD instructions for this
// operation is simple, no other calculations inside, just 3 inputs, 1 out
// all operands are registers (at least should be, if compiler works fine)
for(0 to 32)
   vector4[...] = vector3[...] ? vector2[...]:vetor1[...]

If you write your benchmark codes, I can compare this with other solutions. But it wouldn't matter for good compilers (or good compiler flags) that do same thing automatically for the first benchmark code in question.