How advantageous is using fused multiply-accumulate for double-precision?

I am trying to understand if is advantageous using std::fma with double arguments by looking at the assembly code that is generated, I am using the flag "-O3", and I am comparing the assembly for this two routines:

#include <cmath>
#define FP_FAST_FMAF

float test_1(const double &a, const double &b, const double &c ){
    return a*b + c;
}
float test_2(const double &a, const double &b, const double &c ){
    return std::fma(a,b,c);
}

Using the Compiler Explorer tools, this is the assembly generated for the two routines:

test_1(double const&, double const&, double const&):
        movsd     xmm0, QWORD PTR [rdi]                         #5.12
        mulsd     xmm0, QWORD PTR [rsi]                         #5.14
        addsd     xmm0, QWORD PTR [rdx]                         #5.18
        cvtsd2ss  xmm0, xmm0                                    #5.18
        ret                                                     #5.18
test_2(double const&, double const&, double const&):
        push      rsi                                           #7.65
        movsd     xmm0, QWORD PTR [rdi]                         #8.12
        movsd     xmm1, QWORD PTR [rsi]                         #8.12
        movsd     xmm2, QWORD PTR [rdx]                         #8.12
        call      fma                                           #8.12
        cvtsd2ss  xmm0, xmm0                                    #8.12
        pop       rcx                                           #8.12
        ret

And the assembly does not change by using the latest version available for either icc or gcc. what is puzzling for me regarding the performance of the two routines is that, while for test_1 there is only one memory operation ( movsd ), there are three for test_2, and considering the latency for memory operations is between one and two orders of magnitude larger than the latency for floating-point operations, test_1 shall be more performant. Thus, in which situations is advisable using std::fma? What is mistaken in my hypothesis?

Solution

If your question is related to the number of memory operations only, it is important to note that mulsd and addsd are also memory operations in your example. Memory operations are indicated by the square brackets around the register name, not the assembly mnemonic itself.

If you're still curious if it's advantageous to use std::fma, the answer is probably "it depends."

When you are analyzing performance by looking at assembly, it is almost essential to give the compiler at least some information about your target architecture. std::fma uses hardware FMA instructions if they are available on the target architecture, so whether or not std::fma improves performance in general is not really an answerable question.

If you specify -mfma in Compiler Explorer, the compiler has some information that it can leverage to generate more efficient code. You can also specify -march=[your architecture] which will automatically set -mfma for you if it is supported.

Additionally, there's a whole other can of worms about the slight differences in the results from std::fma and (a*b)+c due to the way rounding is handled with floating point numbers. std::fma only rounds once during the two floating point operations, while (a*b)+c might^[1] do a*b, store the result in 64 bits, add c to this value and then store the result in 64 bits.

If you want to minimize floating point arithmetic error in your calculations, std::fma is probably a better choice because it guarantees you will only have precious bits stripped away from your precious floating point numbers once.

^{[1]^{Whether or not this extra rounding happens depends on your compiler, your optimization settings and your architecture settings:
Compiler Explorer examples for msvc, gcc, icc, clang}}