I have the following code:
#include <iostream>
#include <numeric>
int main() {
volatile float
a0[4] = {1, 2, 3, 4},
a1[4] = {4, 5, 6, 7};
std::cout << std::inner_product(a0, a0 + 4, a1, 0.0F) << std::endl;
return 0;
}
When I compile the code with -O3 -msse2
with GCC or Clang, I could not find evidence of vectorization in the output code.
mulss
instructions and no looping instructions, suggesting that the multiplications are being performed individually.jne
and je
which there should not be if vectorization is happening.volatile
: The arrays were declared volatile
to prevent the compiler from simply evaluating inner_product
at compile time. I also figured this might prevent the desired optimization, so I tried a different version without volatile
but the result is the same, no evidence of vector multiplication instructions.If GCC and Clang are able to auto-vectorize std::inner_product
(which might not be the case, in which case the answer is just 'You cannot'), what are the necessary/correct compiler flags to do so? Are there any vectorization-friendly adjustments to my code (preferably portable ones) that are necessary? Such as ensuring that the data is aligned to the size of a SIMD register, as a guess?
Using comments on the question and further experimentation, I was able to find the answer.
The reason is because vectorizing some float
operations, including those used in inner_product
, requires -ffast-math
to be enabled, because they can introduce rounding errors that performing them in the correct order one-at-a-time would not introduce.
The problem being with the float
can be further shown by using int
instead:
#include <iostream>
#include <numeric>
int main() {
int a0[4], a1[4];
std::cin >> a0[0] >> a0[1] >> a0[2] >> a0[3] >> a1[0] >> a1[1] >> a1[2] >> a1[3];
std::cout << std::inner_product(a0, a0 + 4, a1, 0) << std::endl;
return 0;
}
With just -03
or -Os
, this produces assembly that appears to at least partially vectorize the operation:
pshufd xmm2, xmm0, 245
pmuludq xmm0, xmm1
pshufd xmm0, xmm0, 232
pshufd xmm1, xmm1, 245
pmuludq xmm1, xmm2
pshufd xmm1, xmm1, 232
punpckldq xmm0, xmm1
pshufd xmm1, xmm0, 238
paddd xmm1, xmm0
pshufd xmm0, xmm1, 85
paddd xmm0, xmm1
To get the float
version to vectorize, I can use -ffast-math
(Demo) to tell the compiler to optimize the code regardless the posibility for rounding errors:
movaps xmm0, xmmword ptr [rsp]
mulps xmm0, xmmword ptr [rsp + 16]
movaps xmm1, xmm0
unpckhpd xmm1, xmm0
addps xmm1, xmm0
movaps xmm0, xmm1
shufps xmm0, xmm1, 85
addss xmm0, xmm1