c++gcc clang vectorization inner-product

Is GCC/Clang able to auto-vectorize std::inner_product?

I have the following code:

#include <iostream>
#include <numeric>
int main() {
    volatile float
        a0[4] = {1, 2, 3, 4},
        a1[4] = {4, 5, 6, 7};
    std::cout << std::inner_product(a0, a0 + 4, a1, 0.0F) << std::endl;
    return 0;
}

When I compile the code with -O3 -msse2 with GCC or Clang, I could not find evidence of vectorization in the output code.

In the Clang version there are four mulss instructions and no looping instructions, suggesting that the multiplications are being performed individually.
In the GCC version there are looping instructions such as jne and je which there should not be if vectorization is happening.
If the code is being properly vectorized, there should be a single vector multiplication instruction that computes the element-wise product of corresponding elements in the arrays all at once, before summing them up for the inner product.
Note about volatile: The arrays were declared volatile to prevent the compiler from simply evaluating inner_product at compile time. I also figured this might prevent the desired optimization, so I tried a different version without volatile but the result is the same, no evidence of vector multiplication instructions.

If GCC and Clang are able to auto-vectorize std::inner_product (which might not be the case, in which case the answer is just 'You cannot'), what are the necessary/correct compiler flags to do so? Are there any vectorization-friendly adjustments to my code (preferably portable ones) that are necessary? Such as ensuring that the data is aligned to the size of a SIMD register, as a guess?

Solution

Using comments on the question and further experimentation, I was able to find the answer.

The reason is because vectorizing some float operations, including those used in inner_product, requires -ffast-math to be enabled, because they can introduce rounding errors that performing them in the correct order one-at-a-time would not introduce.

The problem being with the float can be further shown by using int instead:

#include <iostream>
#include <numeric>
int main() {
    int a0[4], a1[4];
    std::cin >> a0[0] >> a0[1] >> a0[2] >> a0[3] >> a1[0] >> a1[1] >> a1[2] >> a1[3];
    std::cout << std::inner_product(a0, a0 + 4, a1, 0) << std::endl;
    return 0;
}

With just -03 or -Os, this produces assembly that appears to at least partially vectorize the operation:

pshufd  xmm2, xmm0, 245
pmuludq xmm0, xmm1
pshufd  xmm0, xmm0, 232
pshufd  xmm1, xmm1, 245
pmuludq xmm1, xmm2
pshufd  xmm1, xmm1, 232
punpckldq       xmm0, xmm1
pshufd  xmm1, xmm0, 238
paddd   xmm1, xmm0
pshufd  xmm0, xmm1, 85
paddd   xmm0, xmm1

To get the float version to vectorize, I can use -ffast-math (Demo) to tell the compiler to optimize the code regardless the posibility for rounding errors:

movaps  xmm0, xmmword ptr [rsp]
mulps   xmm0, xmmword ptr [rsp + 16]
movaps  xmm1, xmm0
unpckhpd        xmm1, xmm0
addps   xmm1, xmm0
movaps  xmm0, xmm1
shufps  xmm0, xmm1, 85
addss   xmm0, xmm1