c++optimization x86-64 compiler-optimization simd

Why is a simple FP loop not auto-vectorized, and slower than a SIMD intrinsics calculation?

(Why?) does the compiler not use SIMD instructions for a simple loop computing a sum, even when compiling with -03 and -march=native?

Consider the following two functions:

float sum_simd(const std::vector<float>& vec) {
    __m256 a{0.0};
    for (std::size_t i = 0; i < vec.size(); i += 8) {
        __m256 tmp = _mm256_loadu_ps(&vec[i]);
        a = _mm256_add_ps(tmp, a);
    }
    float res{0.0};
    for (size_t i = 0; i < 8; ++i) {
        res += a[i];
    }
    return res;
}

float normal_sum(const std::vector<float>& vec) {
    float sum{0};
    for (size_t i = 0; i < vec.size(); ++i) {
        sum += vec[i];
    }
    return sum;
}

The compiler seems to turn the summations into:

vaddps  ymm0, ymm0, ymmword ptr [rax + 4*rdx]

and

vaddss  xmm0, xmm0, dword ptr [rcx + 4*rsi]
vaddss  xmm0, xmm0, dword ptr [rcx + 4*rsi + 4]
vaddss  xmm0, xmm0, dword ptr [rcx + 4*rsi + 8]
vaddss  xmm0, xmm0, dword ptr [rcx + 4*rsi + 12]
vaddss  xmm0, xmm0, dword ptr [rcx + 4*rsi + 16]
vaddss  xmm0, xmm0, dword ptr [rcx + 4*rsi + 20]
vaddss  xmm0, xmm0, dword ptr [rcx + 4*rsi + 24]
vaddss  xmm0, xmm0, dword ptr [rcx + 4*rsi + 28]

When I run this on my machine, I get a substantial speedup (~factor 10) from the SIMD sum. The same is true on Godbolt. See here for the code.

I compiled the program with GCC 13 and Clang 17 and used the options -O3 -march=native.

Why is the function normal_sum slower and not fully vectorized? Do I need to specify additional compiler options?

Solution

Why is the function normal_sum slower and not fully vectorized? Do I need to specify additional compiler options?

Yes. -ffast-math solves this issue (see on Godbolt). Here is the main loop with this additional flag:

.L10:
        vaddps  ymm1, ymm1, YMMWORD PTR [rax]     ;     <---------- vectorized
        add     rax, 32
        cmp     rcx, rax
        jne     .L10

However, note that -ffast-math is a combination of several more specific flags. Some of them can be quite dangerous. For example, -funsafe-math-optimizations and -ffinite-math-only can break existing codes using infinities or reduce their precision. In fact, some codes like a Kahan summation algorithm requires the compiler not to assume floating-point operations are associative (which -ffast-math does). For more information about this, please read the post What does gcc's ffast-math actually do?.

The main reason why the code is not automatically vectorized without -ffast-math is simply because floating-point operations like a sum is not associative (i.e. (a+b)+c != a+(b+c)). Because of that, the compiler cannot reorder the long chain of floating-point additions. Note that there is a flag specifically meant to change this behaviour (-fassociative-math), but it is often not sufficient to auto-vectorize the code (this is the case here). One need to use a combination of flags (subset of -ffast-math) to enable the auto-vectorization regarding the target compiler (and possibly its version).

Non-fast-math is also why the horizontal sum loop in your manually-vectorized version is so inefficient, shuffling to extract each element 1 at a time, instead of shuffling to narrow in half for SIMD sums. See Fastest way to do horizontal SSE vector sum (or other reduction)

A simple architecture-independent way to vectorize the code is to use OpenMP. To do that, you need to add the line #pragma omp simd reduction (+:sum) before the loop and the compilation flag -fopenmp-simd (or the full -fopenmp which would also let it try to auto-parallelize). See on Godbolt.

Clang happens to accept it even without the reduction clause, but this isn't guaranteed to work or even produce correct results. You do actually need to match the + in the reduction clause with the operation you do in the source: for integer at least, if you use reduction (*:sum) or (&:sum), it knows that a starting sum = 0 times or AND anything equal 0, so optimizes to just return 0 even though the loop body is sum += vec[i].

Anyway, for floating-point, omp simd reduction tells the compiler that it's ok to pretend operations on that variable are associative, for just this loop. The rest of your program can still have strict FP semantics.