Search code examples
c++assemblygccx86floating-point-comparison

What causes the different NaN behavior when compiling `_mm_ucomilt_ss` intrinsic?


Can someone explain me why the following code fails for GCC 8.5 with NaNs?

bool isfinite_sse42(float num)
{    
    return _mm_ucomilt_ss(_mm_set_ss(std::abs(num)),
                          _mm_set_ss(std::numeric_limits<float>::infinity())) == 1;
}

My expectation for GCC 8.5 would be to return false.

The Intel Intrinsics guide for _mm_ucomilt_ss says

RETURN ( a[31:0] != NaN AND b[31:0] != NaN AND a[31:0] == b[31:0] ) ? 1 : 0

i.e., if either a or b is NaN it returns 0. On assembly level (Godbolt) one can see a ucomiss abs(x), Infinity followed by a setb.

# GCC8.5 -O2  doesn't match documented intrinsic behaviour for NaN
        ucomiss xmm0, DWORD PTR .LC2[rip]
        setb    al

Interestingly newer GCCs and Clang swap the comparison from a < b to b > a and therefore use seta. But why does the code with setb returns true for NaN and why seta returns false for NaN?


Solution

  • GCC is buggy before GCC13, not implementing the documented semantics of the intrinsic for the NaN case which require either checking PF separately, or doing it as ucomiss Inf, abs so the unordered case sets CF the same way as abs < Inf.

    See https://www.felixcloutier.com/x86/ucomiss#operation or the nicer table in https://www.felixcloutier.com/x86/fcomi:fcomip:fucomi:fucomip . (All x86 scalar FP compares that set EFLAGS do it the same way, matching historical fcom / fstsw / sahf.)

    Comparison Results ZF PF CF
    left > right 0 0 0
    left < right 0 0 1
    left = right 1 0 0
    Unordered 1 1 1

    Notice that CF is set for both the left < right and unordered cases, but not for the other two cases.

    If you can arrange things such that you can check for > or >=, you don't need to setnp cl / and al, cl to rule out Unordered. This is what clang 16 and GCC 13 do to get correct results from ucomiss inf, abs / seta.

    GCC8.5 does the right thing if you write abs(x) < infinity, it's only the scalar intrinsic that it doesn't implement properly. (With plain scalar code, it uses comiss instead of ucomiss, the only difference being that it will update the FP environment with a #I FP-exception on QNaN as well as SNaN.)

    This requires a separate movss load instead of a memory source. But this does let GCC avoid the useless SSE4.1 insertps instruction that zeros the high 3 elements of XMM0, which ucomiss doesn't read anyway. Clang sees that and optimizes away that part of _mm_set_ss(num) but GCC doesn't. The lack of an efficient way to go from a scalar float to a __m128 with don't-care upper elements is a persistent problem in Intel's intrinsics API that only some compilers manage to optimize around. (How to merge a scalar into a vector without the compiler wasting an instruction zeroing upper elements? Design limitation in Intel's intrinsics?) A float is just the low element of a __m128.