I am trying to understand if is advantageous using std::fma with double arguments by looking at the assembly code that is generated, I am using the flag "-O3", and I am comparing the assembly for this two routines:
#include <cmath>
#define FP_FAST_FMAF
float test_1(const double &a, const double &b, const double &c ){
return a*b + c;
}
float test_2(const double &a, const double &b, const double &c ){
return std::fma(a,b,c);
}
Using the Compiler Explorer tools, this is the assembly generated for the two routines:
test_1(double const&, double const&, double const&):
movsd xmm0, QWORD PTR [rdi] #5.12
mulsd xmm0, QWORD PTR [rsi] #5.14
addsd xmm0, QWORD PTR [rdx] #5.18
cvtsd2ss xmm0, xmm0 #5.18
ret #5.18
test_2(double const&, double const&, double const&):
push rsi #7.65
movsd xmm0, QWORD PTR [rdi] #8.12
movsd xmm1, QWORD PTR [rsi] #8.12
movsd xmm2, QWORD PTR [rdx] #8.12
call fma #8.12
cvtsd2ss xmm0, xmm0 #8.12
pop rcx #8.12
ret
And the assembly does not change by using the latest version available for either icc or gcc. what is puzzling for me regarding the performance of the two routines is that, while for test_1 there is only one memory operation ( movsd ), there are three for test_2, and considering the latency for memory operations is between one and two orders of magnitude larger than the latency for floating-point operations, test_1 shall be more performant. Thus, in which situations is advisable using std::fma? What is mistaken in my hypothesis?
If your question is related to the number of memory operations only, it is important to note that mulsd
and addsd
are also memory operations in your example. Memory operations are indicated by the square brackets around the register name, not the assembly mnemonic itself.
If you're still curious if it's advantageous to use std::fma
, the answer is probably "it depends."
When you are analyzing performance by looking at assembly, it is almost essential to give the compiler at least some information about your target architecture. std::fma
uses hardware FMA instructions if they are available on the target architecture, so whether or not std::fma
improves performance in general is not really an answerable question.
If you specify -mfma
in Compiler Explorer, the compiler has some information that it can leverage to generate more efficient code. You can also specify -march=[your architecture]
which will automatically set -mfma
for you if it is supported.
Additionally, there's a whole other can of worms about the slight differences in the results from std::fma
and (a*b)+c
due to the way rounding is handled with floating point numbers. std::fma
only rounds once during the two floating point operations, while (a*b)+c
might[1] do a*b
, store the result in 64 bits, add c
to this value and then store the result in 64 bits.
If you want to minimize floating point arithmetic error in your calculations, std::fma
is probably a better choice because it guarantees you will only have precious bits stripped away from your precious floating point numbers once.
[1] Whether or not this extra rounding happens depends on your compiler, your optimization settings and your architecture settings: Compiler Explorer examples for msvc, gcc, icc, clang