Fused multiply add and default rounding modes

With GCC 5.3 the following code compield with -O3 -fma

float mul_add(float a, float b, float c) {
  return a*b + c;

produces the following assembly

vfmadd132ss     %xmm1, %xmm2, %xmm0

I noticed GCC doing this with -O3 already in GCC 4.8.

Clang 3.7 with -O3 -mfma produces

vmulss  %xmm1, %xmm0, %xmm0
vaddss  %xmm2, %xmm0, %xmm0

but Clang 3.7 with -Ofast -mfma produces the same code as GCC with -O3 fast.

I am surprised that GCC does with -O3 because from this answer it says

The compiler is not allowed to fuse a separated add and multiply unless you allow for a relaxed floating-point model.

This is because an FMA has only one rounding, while an ADD + MUL has two. So the compiler will violate strict IEEE floating-point behaviour by fusing.

However, from this link it says

Regardless of the value of FLT_EVAL_METHOD, any floating-point expression may be contracted, that is, calculated as if all intermediate results have infinite range and precision.

So now I am confused and concerned.

  1. Is GCC justified in using FMA with -O3?
  2. Does fusing violate strict IEEE floating-point behaviour?
  3. If fusing does violate IEEE floating-point beahviour and since GCC returns __STDC_IEC_559__ isn't this a contradiction?

Since FMA can be emulated in software it seems to be there should be two compiler switches for FMA: one to tell the compiler to use FMA in calculations and one to tell the compiler that the hardware has FMA.

Apprently this can be controlled with the option -ffp-contract. With GCC the default is -ffp-contract=fast and with Clang it's not. Other options such as -ffp-contract=on and -ffp-contract=off do no produce the FMA instruction.

For example Clang 3.7 with -O3 -mfma -ffp-contract=fast produces vfmadd132ss.

I checked some permutations of #pragma STDC FP_CONTRACT set to ON and OFF with -ffp-contract set to on, off, and fast. IN all cases I also used -O3 -mfma.

With GCC the answer is simple. #pragma STDC FP_CONTRACT ON or OFF makes no difference. Only -ffp-contract matters.

GCC it uses fma with

  1. -ffp-contract=fast (default).

With Clang it uses fma

  1. with -ffp-contract=fast.
  2. with -ffp-contract=on (default) and #pragma STDC FP_CONTRACT ON (default is OFF).

In other words with Clang you can get fma with #pragma STDC FP_CONTRACT ON (since -ffp-contract=on is the default) or with -ffp-contract=fast. -ffast-math (and hence -Ofast) set -ffp-contract=fast.

I looked into MSVC and ICC.

With MSVC it uses the fma instruction with /O2 /arch:AVX2 /fp:fast. With MSVC /fp:precise is the default.

With ICC it uses fma with -O3 -march=core-avx2 (acctually -O1 is sufficient). This is because by default ICC uses -fp-model fast. But ICC uses fma even with -fp-model precise. To disable fma with ICC use -fp-model strict or -no-fma.

So by default GCC and ICC use fma when fma is enabled (with -mfma for GCC/Clang or -march=core-avx2 with ICC) but Clang and MSVC do not.


  • It doesn't violate IEEE-754, because IEEE-754 defers to languages on this point:

    A language standard should also define, and require implementations to provide, attributes that allow and disallow value-changing optimizations, separately or collectively, for a block. These optimizations might include, but are not limited to:


    ― Synthesis of a fusedMultiplyAdd operation from a multiplication and an addition.

    In standard C, the STDC FP_CONTRACT pragma provides the means to control this value-changing optimization. So GCC is licensed to perform the fusion by default, so long as it allows you to disable the optimization by setting STDC FP_CONTRACT OFF. Not supporting that means not adhering to the C standard.