(Why?) does the compiler not use SIMD instructions for a simple loop computing a sum, even when compiling with -03 and -march=native
?
Consider the following two functions:
float sum_simd(const std::vector<float>& vec) {
__m256 a{0.0};
for (std::size_t i = 0; i < vec.size(); i += 8) {
__m256 tmp = _mm256_loadu_ps(&vec[i]);
a = _mm256_add_ps(tmp, a);
}
float res{0.0};
for (size_t i = 0; i < 8; ++i) {
res += a[i];
}
return res;
}
float normal_sum(const std::vector<float>& vec) {
float sum{0};
for (size_t i = 0; i < vec.size(); ++i) {
sum += vec[i];
}
return sum;
}
The compiler seems to turn the summations into:
vaddps ymm0, ymm0, ymmword ptr [rax + 4*rdx]
and
vaddss xmm0, xmm0, dword ptr [rcx + 4*rsi]
vaddss xmm0, xmm0, dword ptr [rcx + 4*rsi + 4]
vaddss xmm0, xmm0, dword ptr [rcx + 4*rsi + 8]
vaddss xmm0, xmm0, dword ptr [rcx + 4*rsi + 12]
vaddss xmm0, xmm0, dword ptr [rcx + 4*rsi + 16]
vaddss xmm0, xmm0, dword ptr [rcx + 4*rsi + 20]
vaddss xmm0, xmm0, dword ptr [rcx + 4*rsi + 24]
vaddss xmm0, xmm0, dword ptr [rcx + 4*rsi + 28]
When I run this on my machine, I get a substantial speedup (~factor 10) from the SIMD sum. The same is true on Godbolt. See here for the code.
I compiled the program with GCC 13 and Clang 17 and used the options -O3 -march=native
.
Why is the function normal_sum
slower and not fully vectorized? Do I need to specify additional compiler options?
Why is the function normal_sum slower and not fully vectorized? Do I need to specify additional compiler options?
Yes. -ffast-math
solves this issue (see on Godbolt). Here is the main loop with this additional flag:
.L10:
vaddps ymm1, ymm1, YMMWORD PTR [rax] ; <---------- vectorized
add rax, 32
cmp rcx, rax
jne .L10
However, note that -ffast-math
is a combination of several more specific flags. Some of them can be quite dangerous. For example, -funsafe-math-optimizations
and -ffinite-math-only
can break existing codes using infinities or reduce their precision. In fact, some codes like a Kahan summation algorithm requires the compiler not to assume floating-point operations are associative (which -ffast-math
does).
For more information about this, please read the post What does gcc's ffast-math actually do?.
The main reason why the code is not automatically vectorized without -ffast-math
is simply because floating-point operations like a sum is not associative (i.e. (a+b)+c != a+(b+c)
). Because of that, the compiler cannot reorder the long chain of floating-point additions. Note that there is a flag specifically meant to change this behaviour (-fassociative-math
), but it is often not sufficient to auto-vectorize the code (this is the case here). One need to use a combination of flags (subset of -ffast-math
) to enable the auto-vectorization regarding the target compiler (and possibly its version).
Non-fast-math is also why the horizontal sum loop in your manually-vectorized version is so inefficient, shuffling to extract each element 1 at a time, instead of shuffling to narrow in half for SIMD sums. See Fastest way to do horizontal SSE vector sum (or other reduction)
A simple architecture-independent way to vectorize the code is to use OpenMP. To do that, you need to add the line #pragma omp simd reduction (+:sum)
before the loop and the compilation flag -fopenmp-simd
(or the full -fopenmp
which would also let it try to auto-parallelize). See on Godbolt.
Clang happens to accept it even without the reduction
clause, but this isn't guaranteed to work or even produce correct results. You do actually need to match the +
in the reduction clause with the operation you do in the source: for integer at least, if you use reduction (*:sum)
or (&:sum)
, it knows that a starting sum = 0
times or AND anything equal 0, so optimizes to just return 0 even though the loop body is sum += vec[i]
.
Anyway, for floating-point, omp simd reduction
tells the compiler that it's ok to pretend operations on that variable are associative, for just this loop. The rest of your program can still have strict FP semantics.