optimization floating-point avx sse2 intel-ipp

Flush-to-zero denormals - is it reliable?

For signal processing this has been an issue like forever and right I'm still taking precautions of adding a small constant whenever a denormal can happen, e.g.:

float coef = 0.9f;
for (int i=0; i<cnt; i++) dst[i] = state = state * 0.9f + 1E-15f;

This is obviously hardly ideal, but in the past I had numerous problems that even if I tried to set FTZ, it actually didn't work on some computers apparently. Currently I'm using Intel IPP like this:

ippSetDenormAreZeros(b);
const int success = ippSetFlushToZero(b, NULL) == ippStsNoErr;

So how reliable is this? Is there a better way? A reliable way? Unfortunately I need to support like all ancient CPUs from say Core2duo, Windows and OSX. However I'm generally using SSE2 and newer and CLANG with -mfpmath=sse and -ffast-math.

Solution

You say "generally", but x87 math doesn't have any equivalent of the FTZ/DAZ bits in the MXCSR. Only SSE/AVX math has those. So if you ever compile legacy 32-bit code using x87 math, you might still get slowdowns from subnormals because the hardware has no means of disabling gradual underflow for them. (And x87 is also slow on NaN / Inf where SSE isn't.)

In general, linking with -ffast-math will make your compiler link in CRT startup code that sets those MXCSR bits, but in shared-library functions you can't assume they'll be set unless you set them yourself. (And remember they're per-thread. I'm not sure if new threads inherit from the parent, or start with the default settings).

As far as changing on the fly and compile-time reordering of function calls with independent FP math, the compiler can assume that nothing changes the FP rounding mode on the fly unless you use #pragma STDC FENV_ACCESS ON in C.

But that only exists in C, not C++, according to Does FENV_ACCESS pragma exist in C++11 and higher?

And compilers in practice might not be perfect at this. I don't really know. Hopefully someone else has a more specific answer about this.

But yes, with FTZ and DAZ set, no existing x86 hardware will ever have any FP-assist slowdowns when running SSE/AVX math instructions.