MSVC supports AVX/AVX2 instructions for years now and according to this msdn blog post, it can automatically generate fused-multiply-add (FMA) instructions.
Yet neither of the following functions compile to FMA instruction:
float func1(float x, float y, float z)
{
return x * y + z;
}
float func2(float x, float y, float z)
{
return std::fma(x,y,z);
}
Even worse, std::fma is not implemented as a single FMA instruction, it performs terribly, much slower than a plain x * y + z
(the poor performance of std::fma is expected if the implementation doesn't rely on FMA instruction).
I compile with /arch:AVX2 /O2 /Qvec
flags.
Also tried it with /fp:fast
, no success.
So the question is how can MSVC forced to automatically emit FMA instructions?
UPDATE
There is a #pragma fp_contract (on|off)
, which (looks like) does nothing.
I solved this long-standing problem.
As it turns out, flags /fp:fast
, /arch:AVX2
and /O1
(or above /O1
) are not enough for Visual Studio 2015 mode to emit FMA instructions in 32-bits mode. You also need the "Whole Program Optimization" turned on with flag /GL
.
Then Visual Studio 2015 will generate an FMA instruction vfmadd213ss
for
float func1(float x, float y, float z)
{
return x * y + z;
}
Regarding std::fma
, I opened a bug at Microsoft Connect. They confirmed the behavior that std::fma
doesn't compile to FMA instructions, because the compiler doesn't treat it as an intrinsic. According to their response it will be fixed in a future update to get the best codegen possible.