Search code examples
c++assemblyoptimizationsimdintrinsics

How to optimize a test to check if std::array<float, 4> contains an out of range value?


I have a 4D vector: std::array<float, 4>

I want to check if all it's components are inside the value range: 0.0f <= X && X < 256.0f

How do I check if any of the vector components are outside of this range? All I need is a single bool to know if that whole vector passes or fails the test.

My first attempt to solve this was the following piece of code:

bool Check_If_Outside_2(std::array<float, 4> vec)
{
    bool outside = false;

    for (int i = 0; i < 4; i++)
        if (vec[i] < VType(0) || vec[i] >= VType(256))
            outside = true;

    return outside;
}

Which resulted in this assembler output:

Check_If_Outside_2(std::array<float, 4ul>):  # @Check_If_Outside_2(std::array<float, 4ul>)
        vxorps  xmm2, xmm2, xmm2
        vucomiss        xmm0, xmm2
        setb    al
        vmovss  xmm3, dword ptr [rip + .LCPI7_0] # xmm3 = mem[0],zero,zero,zero
        vucomiss        xmm0, xmm3
        setae   cl
        or      cl, al
        vmovshdup       xmm0, xmm0              # xmm0 = xmm0[1,1,3,3]
        vucomiss        xmm0, xmm2
        setb    dl
        vucomiss        xmm0, xmm3
        setae   al
        or      al, dl
        or      al, cl
        vxorps  xmm0, xmm0, xmm0
        vcmpltps        xmm0, xmm1, xmm0
        vpermilps       xmm0, xmm0, 212         # xmm0 = xmm0[0,1,1,3]
        vmovaps xmm2, xmmword ptr [rip + .LCPI7_1] # xmm2 = <2.56E+2,2.56E+2,u,u>
        vcmpleps        xmm1, xmm2, xmm1
        vpermilps       xmm1, xmm1, 212         # xmm1 = xmm1[0,1,1,3]
        vorps   xmm0, xmm0, xmm1
        vpsllq  xmm0, xmm0, 63
        vmovmskpd       ecx, xmm0
        or      al, cl
        shr     cl
        or      al, cl
        and     al, 1
        ret

Then I tried the following optimized version, which uses the idea, that negative integer values fill the top bits in the register. So I simply convert the floats into integers and test the high bits if any of them are non-zero. If there are non-zero bits, the vector's component is either negative or above the legal maximum range.

template<typename T, unsigned long N>
static inline std::array<int32_t, N> To_Int_Vec(const std::array<T, N>& x)
{
    std::array<int32_t, N> int_vec;

    for (int i = 0; i < N; ++i)
        int_vec[i] = floor(x[i]);

    return int_vec;
}


bool Check_If_Outside(std::array<float, 4> vec)
{
    constexpr int32_t neg_mask = ~255;

    auto vec_int = To_Int_Vec(vec);

    bool outside = false;

    for (int i = 0; i < 4; i++)
        if (vec_int[i] & neg_mask)
            outside = true;

    return outside;
}

Which gave the following assembler output:

Check_If_Outside(std::array<float, 4ul>):    # @Check_If_Outside(std::array<float, 4ul>)
        vshufps xmm0, xmm0, xmm1, 65            # xmm0 = xmm0[1,0],xmm1[0,1]
        vroundps        xmm0, xmm0, 9
        vcvttps2dq      xmm0, xmm0
        vpshufd xmm1, xmm0, 78                  # xmm1 = xmm0[2,3,0,1]
        vpor    xmm0, xmm0, xmm1
        vpshufd xmm1, xmm0, 229                 # xmm1 = xmm0[1,1,2,3]
        vpor    xmm0, xmm0, xmm1
        vmovd   eax, xmm0
        cmp     eax, 255
        seta    al
        ret

I would assume that this kind of test should still be possible to optimize much further using Intel SIMD instructions, but I'm not sure how to do it. Is it possible to handle it any better in pure C++ or are intrinsics required, or even inline assembler? Or is it even possible to optimize it any further?

To see the optimized output, I'm using x86-64 clang 11.0.0, with compiler flags: -O3 -mtune=skylake -ffast-math -funsafe-math-optimizations -fno-math-errno -msse4.1 -mavx -mfma4

EDIT: Issue solved with your help! Thank you!


Solution

  • Using SIMD directly even seems simpler than the original code:

    bool Check_If_Outside(std::array<float, 4> vec)
    {
        __m128 v = _mm_loadu_ps(vec.data());
        __m128 tooHigh = _mm_cmpge_ps(v, _mm_set1_ps(256));
        return _mm_movemask_ps(_mm_or_ps(v, tooHigh));
    }
    

    Resulting code (at least the way I compiled it):

    Check_If_Outside(std::array<float, 4ul>):    # @Check_If_Outside(std::array<float, 4ul>)
            vmovlhps        xmm0, xmm0, xmm1                # xmm0 = xmm0[0],xmm1[0]
            vbroadcastss    xmm1, dword ptr [rip + .LCPI1_0] # xmm1 = [2.56E+2,2.56E+2,2.56E+2,2.56E+2]
            vcmpleps        xmm1, xmm1, xmm0
            vorps   xmm0, xmm0, xmm1
            vmovmskps       eax, xmm0
            test    eax, eax
            setne   al
            ret
    

    So we skip the conversion to integers, and instead of doing a horizontal-OR we rely on vmovmskps.

    Another idea, based on comparing the floats by their raw bit pattern,

    bool Check_If_Outside(std::array<float, 4> vec)
    {
        __m128i v = _mm_loadu_si128((__m128i*)vec.data());
        __m128i bits256 = _mm_set1_epi32(0x43800000);
        __m128i cmp = _mm_cmpeq_epi32(_mm_min_epu32(v, bits256), bits256);
        return _mm_movemask_ps(_mm_castsi128_ps(cmp));
    }
    

    Either way I am blatantly ignoring the existence of negative zero, and of NaN but -ffast-math already ignores NaN anyway.

    uiCA results for Skylake, if I pretend that the input comes from memory (instead of shuffling them together from registers - to break any accidental dependency):

    • Original: 3.39
    • Version 1, floating point compare: 2.0
    • Version 2, integer compare: 2.0
    • this from the comments: 3.21
    • this from the comments: 3.00 (it gets a bit worse when AVX2 is enabled)

    In all cases this is only some rough indication. The code here is analyzed by itself, not in any real context, but throughput is context-dependent (it affects eg execution port pressure). Also, the setcc should disappear in most real usages (where this function is inlined into the caller and the condition is immediately branched on). And shuffling the function arguments together into one vector (or loading the input from memory, which I used to do the uiCA analysis) would be replaced by whatever actually produces the input.