Eigen::VectorXd::operator += seems ~69% slower than looping through a std::vector

The below code (needs google benchmark) fills up two vectors and adds them up, storing the result in the first one. For the vector types I've used Eigen::VectorXd and std::vector for performance comparison:

#include <Eigen/Core>
#include <benchmark/benchmark.h>
#include <vector>

auto constexpr N = 1024u;

template <typename TVector>
TVector generate(unsigned min) {
    TVector v(N);
    for (unsigned i = 0; i < N; ++i)
        v[i] = static_cast<double>(min + i);
    return v;
}

auto ev1 = generate<Eigen::VectorXd>(0);
auto ev2 = generate<Eigen::VectorXd>(N);
auto sv1 = generate<std::vector<double>>(0);
auto sv2 = generate<std::vector<double>>(N);

void add_vectors(Eigen::VectorXd& v1, Eigen::VectorXd const& v2) {
    v1 += v2;
}

void add_vectors(std::vector<double>& v1, std::vector<double> const& v2) {
    for (unsigned i = 0; i < N; ++i)
        v1[i] += v2[i];
}

static void eigen(benchmark::State& state) {
    for (auto _ : state) {
        add_vectors(ev1, ev2);
        benchmark::DoNotOptimize(ev1);
    }
}

static void standard(benchmark::State& state) {
    for (auto _ : state) {
        add_vectors(sv1, sv2);
        benchmark::DoNotOptimize(sv1);
    }
}

BENCHMARK(standard);
BENCHMARK(eigen);

I'm running it on Intel Xeon E-2286M @2.40Ghz, using Eigen 3.3.9, MSVC 16.11.2 with (among others) these relevant compiler swicthes /GL, /Gy, /O2, /D "NDEBUG", /Oi, and /arch:AVX. A tipical output looks like this:

Run on (16 X 2400 MHz CPU s)
CPU Caches:
  L1 Data 32K (x8)
  L1 Instruction 32K (x8)
  L2 Unified 262K (x8)
  L3 Unified 16777K (x1)
--------------------------------------------------
Benchmark           Time           CPU Iterations
--------------------------------------------------
standard           99 ns        100 ns    7466667
eigen             169 ns        169 ns    4072727

which seems to show that operating on std::vector is ~69% faster than on Eigen::VectorXd. In the disassembly, the tight loops look like these:

// For Eigen::VectorXd
00007FF672221A11  vmovupd     ymm0,ymmword ptr [rcx+rax*8]  
00007FF672221A16  vaddpd      ymm1,ymm0,ymmword ptr [r8+rax*8]  
00007FF672221A1C  vmovupd     ymmword ptr [r8+rax*8],ymm1  
00007FF672221A22  add         rax,4  
00007FF672221A26  cmp         rax,rdx  
00007FF672221A29  jge         eigen+0C7h (07FF672221A37h)  
00007FF672221A2B  mov         rcx,qword ptr [rsp+48h]  
00007FF672221A30  mov         r8,qword ptr [rsp+58h]  
00007FF672221A35  jmp         eigen+0A1h (07FF672221A11h)  

// For std::vector
00007FF672221B40  vmovups     ymm1,ymmword ptr [rax+rdx-20h]  
00007FF672221B46  vaddpd      ymm1,ymm1,ymmword ptr [rax+rcx-20h]  
00007FF672221B4C  vmovups     ymmword ptr [rax+rcx-20h],ymm1  
00007FF672221B52  vmovups     ymm1,ymmword ptr [rax+rdx]  
00007FF672221B57  vaddpd      ymm1,ymm1,ymmword ptr [rax+rcx]  
00007FF672221B5C  vmovups     ymmword ptr [rax+rcx],ymm1  
00007FF672221B61  lea         rax,[rax+40h]  
00007FF672221B65  sub         r8,1  
00007FF672221B69  jne         standard+0C0h (07FF672221B40h)

One can notice that both use vaddpd to add 4 doubles at time. However, for std::vector the compiler unrolled the loop to perform 2 vaddpd per iteration but it didn't do the same for Eigen::VectorXd. Another potentially important difference is that the loop for std::vector is aligned to 32 bytes (address ends in 0x40 = 64 = 2*32).

FWIW: I've added /Qvec-report:2 and the compiler reports:

[...]\Core\AssignEvaluator.h(415) : info C5002: loop not vectorized due to reason '1305'

and reason 1305 means "Not enough type information."

My educated guess is that Eigen's effort to use intrinsics (here _mm256_add_pd) is counterproductive and confuses the compiler. Just leaving the compiler do its business (auto-vectorisation) seems to be a better idea. Am I missing something or could this be considered an Eigen bug (missed optimisation opportunity)?

Solution

TL;DR: The problem mainly comes from the constant loop bound and not directly from Eigen. Indeed, in the first case, Eigen store the size of the vectors in vector attributes while in the second case, you explicitly use the constant N.

Clever compilers can use this information to unroll loops more aggressively because they know that N is quite big. Unrolling a loop with a small N is a bad idea since the code will be bigger and has to read by the processor. If the code is not already loaded in the L1 cache, it must be loaded from the other caches, the RAM or even the storage device in the worst case. The added latency is often bigger than executing a sequential loop with a small unroll factor. This is why compilers do not always unroll loops (at least not with a big unroll factor).

Inlining also plays an important role in this code. Indeed, if the functions are inlined, the compiler can propagate constants and know the size of the vector enabling it to further optimize the code by unrolling the loop more aggressively. However, if the functions are not inlined, then there is no way the compiler can know the loop bounds. Clever compilers can still generate conditional algorithm to optimize both small loops and big ones but this makes the program bigger and introduces a small overhead. Compilers like ICC and Clang do generate the different code alternatives when the code can be vectorized but the loop bounds are unknown or also when aliasing is not known at compile time (the number of generated variants can quickly be huge and so the code size).

Note that inlining functions may not be enough since the constant propagation can be trapped by a complex conditionals dealing with runtime-defined variables or non-inlined function calls. Alternatively, the quality of the constant propagation may not be sufficient for the target example.

Finally, aliasing also play a critical role in the ability of compilers to generate SIMD instructions (and possibly better unroll the loop) in this code. Indeed, aliasing often prevent the use of SIMD instructions and it is not always easy for compilers to check aliasing and generate fast implementations accordingly.

Testing the hypothesises

If the vector-based implementation use a loop bound stored in the vector objects, then the code generated by MSVC is not vectorized in the benchmark: the constant is not propagated correctly despite the inlining of the function. The resulting code should be much slower. Here is the generated code:

$LL24@standard:
        vmovsd  xmm0, QWORD PTR [r9+rcx*8]
        vaddsd  xmm1, xmm0, QWORD PTR [r8+rcx*8]
        vmovsd  QWORD PTR [r8+rcx*8], xmm1
        mov     rax, QWORD PTR std::vector<double,std::allocator<double> > sv1+8
        inc     edx
        sub     rax, QWORD PTR std::vector<double,std::allocator<double> > sv1
        sar     rax, 3
        mov     ecx, edx
        cmp     rcx, rax
        jb      SHORT $LL24@standard

If the Eigen-based implementation use a constant loop bound, then the generated code by MSVC is well vectorized and unrolled correctly in the benchmark: the compile-time constant helps the compiler to generate an loop unrolled 2 times. It does that by mixing SSE and AVX instructions which is very surprising (this point is discussed below). The resulting code should be significantly faster than the original Eigen implementation. However, it may not be as fast as the initial vector implementation due to the unexpected use of SSE instructions. Here is the generated code:

$LL24@eigen:
        vmovupd xmm1, XMMWORD PTR [rdx+rcx-16]
        vaddpd  xmm1, xmm1, XMMWORD PTR [rcx-16]
        vmovupd xmm2, XMMWORD PTR [rcx+rdx]
        vmovupd XMMWORD PTR [rcx-16], xmm1
        vaddpd  xmm1, xmm2, XMMWORD PTR [rcx]
        vmovupd XMMWORD PTR [rcx], xmm1
        vmovups ymm1, YMMWORD PTR [rdx+rcx+16]
        vaddpd  ymm1, ymm1, YMMWORD PTR [rcx+16]
        vmovups YMMWORD PTR [rcx+16], ymm1
        lea     rcx, QWORD PTR [rcx+64]
        sub     rax, 1
        jne     SHORT $LL24@eigen

Additional notes

It is worth noting that the generated code for the non-inlined version use a very inefficient scalar code (typically due to N being unknown and pointer aliasing expected to be possible).

Mixing SSE and AVX instructions in such a loop in your case is clearly a sub-optimal strategy and likely a compiler issue/bug. Indeed, the execution speed of the resulting code is certainly bounded by the store instructions on Intel processors like your. Your processor can execute 1 store instruction per cycle, 2 load instructions per cycle and can compute 2 vectorized addition per cycle. It can execute up to 6 micro-instructions per cycle (coming from 5 independent instructions and possibly 4 cached additional instructions). As a result, the generated code mixing SSE and AVX will at least take 3 cycles per iterations. Meanwhile, the original vector-based implementation could execute 4 loads, 2 stores, 2 additions and 3 other instructions like lea/sub/branch in only 2 cycles (possibly 3 in practice due to to complex hardware stuff like the actual micro-instruction port scheduling, the micro-instruction cache). However, note that the compiler argument do not specify to optimize the code for your specific processor architecture (ie. Intel Coffee Lake). Still, I highly doubt mixing SSE and AVX code would result in any significant boost in performance on an AMD processors too (or any mainstream x86 processors). Alternatively, I might be because the MSVC fails to fully detect that there is no aliasing in this case.

To get rid of the most aliasing problems preventing code vectorization and loop unrolling, OpenMP SIMD directives (eg. #pragma omp simd) can be used. MSVC support this experimentally using the flag /openmp:experimental. Here is resulting code:

void add_vectors(Eigen::VectorXd& v1, Eigen::VectorXd const& v2) {
    #pragma omp simd
    for (unsigned i = 0; i < N; ++i)
        v1[i] += v2[i];
}

MSVC surprisingly generates an assembly code with only SSE instructions, but if you enable AVX2, then it generate a relatively good code:

$LL26@eigen:
        mov     rcx, QWORD PTR Eigen::Matrix<double,-1,1,0,-1,1> ev1
        lea     rdx, QWORD PTR [rdx+128]
        mov     rax, QWORD PTR Eigen::Matrix<double,-1,1,0,-1,1> ev2
        vmovupd ymm0, YMMWORD PTR [rdx+rcx-192]
        vaddpd  ymm0, ymm0, YMMWORD PTR [rdx+rax-192]
        vmovupd YMMWORD PTR [rdx+rcx-192], ymm0
        mov     rcx, QWORD PTR Eigen::Matrix<double,-1,1,0,-1,1> ev1
        mov     rax, QWORD PTR Eigen::Matrix<double,-1,1,0,-1,1> ev2
        vmovupd ymm0, YMMWORD PTR [rdx+rcx-160]
        vaddpd  ymm0, ymm0, YMMWORD PTR [rdx+rax-160]
        vmovupd YMMWORD PTR [rdx+rcx-160], ymm0
        mov     rcx, QWORD PTR Eigen::Matrix<double,-1,1,0,-1,1> ev1
        mov     rax, QWORD PTR Eigen::Matrix<double,-1,1,0,-1,1> ev2
        vmovupd ymm0, YMMWORD PTR [rdx+rcx-128]
        vaddpd  ymm0, ymm0, YMMWORD PTR [rdx+rax-128]
        vmovupd YMMWORD PTR [rdx+rcx-128], ymm0
        mov     rcx, QWORD PTR Eigen::Matrix<double,-1,1,0,-1,1> ev1
        mov     rax, QWORD PTR Eigen::Matrix<double,-1,1,0,-1,1> ev2
        vmovupd ymm0, YMMWORD PTR [rdx+rcx-96]
        vaddpd  ymm0, ymm0, YMMWORD PTR [rdx+rax-96]
        vmovupd YMMWORD PTR [rdx+rcx-96], ymm0
        sub     r8, 1
        jne     $LL26@eigen

This code is still not perfect due to the unexpected useless mov instructions.

Alternatively, it may be possible to use fixed-size Eigen vectors for better performance.

Finally, note that other compilers (like Clang, ICC and GCC) behave very differently on this benchmark.