When compiled for processor that support AVX
extension (say -m64 -march=corei7-avx -mtune=corei7-avx
is applicable), does it make sense to use -mfpmath=both -mavx
keys at the same time? Does not it so much that it causes the compiler to use three sets of instructions (i87
, SSE
, AVX
) at the same time? Or just i87
for scalars (in some sense) and AVX
for vectors only?
You normally don't want this; I don't think gcc is smart enough at deciding to use x87 when there's high register pressure to make it worth it.
x87 and SSE/AVX instructions compete for the same FP execution units on normal x86 CPUs (Intel and AMD), so you don't get more throughput from interleaving them.
Normally you should just use -mfpmath=sse
(which means AVX when used with -mavx
, or better -mfpmath=sse -march=native
. The default for x86-64 is sse
, so -mfpmath=sse
only changes anything for -m32
, AFAIK.
The main benefit of -mfpmath=both
is more total registers, but managing the x87 register stack often costs extra instructions. Moving data between x87 and AVX also costs a store/reload (store-forwarding round trip, ~6 cycle latency on Haswell, http://agner.org/optimize/), so it's only really useful if you have two independent sets of calculations for the compiler to interleave. Otherwise it's not better than normal spill/reload.
Last time I looked at gcc -O3 -mfpmath=both
, the results were not impressive: https://godbolt.org/g/p2KLEC shows gcc5.4 using some store/reloads to bounce data between x87 and xmm (AVX) registers. It would be better off just keeping some of the constants in memory.