c++gcc avx floating-point-exceptions fenv

Why is fetestexcept in C++ compiled to a function call rather than inlined

I am evaluating the usage (clearing and querying) of Floating-Point Exceptions in performance-critical/"hot" code. Looking at the binary produced I noticed that neither GCC nor Clang expand the call to an inline sequence of instructions that I would expect; instead they seem to generate a call to the runtime library. This is prohibitively expensive for my application.

Consider the following minimal example:

#include <fenv.h>
#pragma STDC FENV_ACCESS on

inline int fetestexcept_inline(int e)
{
  unsigned int mxcsr;
  asm volatile ("vstmxcsr" " %0" : "=m" (*&mxcsr));
  return mxcsr & e & FE_ALL_EXCEPT;
}

double f1(double a)
{
    double r = a * a;
    if(r == 0 || fetestexcept_inline(FE_OVERFLOW)) return -1;
    else return r;
}

double f2(double a)
{
    double r = a * a;
    if(r == 0 || fetestexcept(FE_OVERFLOW)) return -1;
    else return r;
}

And the output as produced by GCC: https://godbolt.org/z/jxjzYY

The compiler seems to know that he can use the CPU-family-dependent AVX-instructions for the target (it uses "vmulsd" for the multiplication). However, no matter which optimization flags I try, it will always produce the much more expensive function call to glibc rather than the assembly that (as far as I understand) should do what the corresponding glibc function does.

This is not intended as a complaint, I am OK with adding the inline assembly. I just wonder whether there might be a subtle difference that I am overlooking that could be a bug in the inline-assembly-version.

Solution

It's required to support long double arithmetic. fetestexcept needs to merge the SSE and FPU states because long double operations only update the FPU state, but not the MXSCR register. Therefore, the benefit from inlining is somewhat reduced.