What series of intrinsics will complete this paeth prediction code?

I have a Paeth Prediction function which operates on arrays:

std::array<std::uint8_t,4> birunji::paeth_prediction
    (const std::array<std::uint8_t,4>& a,
     const std::array<std::uint8_t,4>& b,
     const std::array<std::uint8_t,4>& c)
{
    std::array<std::int16_t,4> pa;
    std::array<std::int16_t,4> pb;
    std::array<std::int16_t,4> pc;

    std::array<std::uint8_t,4> results;

    for(std::size_t i = 0; i < 4; ++i)
    {
        pa[i] = b[i] - c[i];
        pb[i] = a[i] - c[i];
        pc[i] = pa[i] + pb[i];

        pa[i] = std::abs(pa[i]);
        pb[i] = std::abs(pb[i]);
        pc[i] = std::abs(pc[i]);

        if(pa[i] <= pb[i] && pa[i] <= pc[i])
            results[i] = a[i];
        else if(pb[i] <= pc[i])
            results[i] = b[i];
        else
            results[i] = c[i];
    }

    return results;
}

I'm attempting to use intrinsics manually to vectorise the code (for learning purposes).

__m128i birunji::paeth_prediction(const __m128i& a,
                                  const __m128i& b,
                                  const __m128i& c)
{
    __m128i pa = _mm_sub_epi16(b, c);
    __m128i pb = _mm_sub_epi16(a, c);
    __m128i pc = _mm_add_epi16(pa, pb);
    
    pa = _mm_abs_epi16(pa);
    pb = _mm_abs_epi16(pb);
    pc = _mm_abs_epi16(pc);

    __m128i pa_le_pb = _mm_cmpgt_epi16(pb, pa);
    __m128i pa_le_pc = _mm_cmpgt_epi16(pc, pa);
    __m128i pb_le_pc = _mm_cmpgt_epi16(pc, pb);

    return
    _mm_and_si128(_mm_and_si128(pa_le_pb, pa_le_pc),
                  _mm_and_si128(_mm_and_si128(pb_le_pc,b),a));
}

The trouble I'm having is the conditional statements. How do I successfully vectorize these? I'm not sure if my attempt above it correct.

Solution

_mm_cmpgt_epi16 can be used for the comparisons. Note that _mm_cmpgt_epi16(a, b) = !(a <= b), however _mm_cmpgt_epi16(b, a) != (a <= b), because it is not a Greater or Equal comparison but a strict Greater Than comparison. So the masks come out inverted, but that's equally useful in this case, an explicit inversion won't be necessary.

This function should not return a condition itself, it should select from a and b and c according to the conditions. If SSE4.1 is available, _mm_blendv_epi8 can be used to implement that selection. For example (not tested):

__m128i paeth_prediction(__m128i a, __m128i b, __m128i c)
{
    __m128i pa = _mm_sub_epi16(b, c);
    __m128i pb = _mm_sub_epi16(a, c);
    __m128i pc = _mm_add_epi16(pa, pb);
    
    pa = _mm_abs_epi16(pa);
    pb = _mm_abs_epi16(pb);
    pc = _mm_abs_epi16(pc);

    __m128i not_pa_le_pb = _mm_cmpgt_epi16(pa, pb);
    __m128i not_pa_le_pc = _mm_cmpgt_epi16(pa, pc);
    __m128i not_pb_le_pc = _mm_cmpgt_epi16(pb, pc);
    __m128i not_take_a = _mm_or_si128(not_pa_le_pb, not_pa_le_pc);
    __m128i t = _mm_blendv_epi8(b, c, not_pb_le_pc);
    return _mm_blendv_epi8(a, t, not_take_a);
}

The last two lines implement logic like:

if PB is not less-than-or-equal-to PC, take C, otherwise take B.
if PA is not less-than-or-equal-to PB or PA is not less-than-or-equal-to PC, take the result from the previous step, otherwise take A.

Without SSE4.1, the blends could be implemented using AND/ANDNOT/OR.

I've changed the signature of the function so it takes the vectors by value, taking them by const reference is unnecessary (vectors are trivial to copy) and can add overhead from an indirection, though such overhead is likely to be removed if the function ends up being inlined by the compiler.

As a variant, _mm_min_epi16 could be used to implement part of the logic:

__m128i paeth_prediction(__m128i a, __m128i b, __m128i c)
{
    __m128i pa = _mm_sub_epi16(b, c);
    __m128i pb = _mm_sub_epi16(a, c);
    __m128i pc = _mm_add_epi16(pa, pb);
    
    pa = _mm_abs_epi16(pa);
    pb = _mm_abs_epi16(pb);
    pc = _mm_abs_epi16(pc);

    __m128i not_pb_le_pc = _mm_cmpgt_epi16(pb, pc);
    __m128i take_a = _mm_cmpeq_epi16(pa, _mm_min_epi16(pa, _mm_min_epi16(pb, pc)));
    __m128i t = _mm_blendv_epi8(b, c, not_pb_le_pc);
    return _mm_blendv_epi8(t, a, take_a);
}

Because the condition pa <= pb && pa <= pc is equivalent to pa == min(pa, pb, pc).

The resulting assembly code looks a bit better, but I did not test it in any way, including performance.