In the BSD Library Functions Manual of FMA(3), it says "These functions compute x * y + z."
So what's the difference between FMA and a naive code which does x * y + z? And why FMA has a better performance in most cases?
a*b+c
produces a result as if the computation were:
a
and b
.c
.fma(a, b, c)
produces a result as if the computation were:
a
and b
.c
.So it skips the step of rounding the intermediate product to the floating-pint format.
On a processor with an FMA instruction, a fused multiply-add may be faster because it is one floating-point instruction instead of two, and hardware engineers can often design the processor to do it efficiently. On a processor without an FMA instruction, a fused multiply-add may be slower because the software has to use extra instructions to maintain the information necessary to get the required result.