Why is setZero faster with Eigen dense dynamic matrix than static matrix?

I compiled the following code for testing

#include <iostream>
#include <Eigen/Dense>
#include <chrono>

int main()
{
    constexpr size_t t = 10000;
    constexpr size_t size = 100;

    auto t1 = std::chrono::steady_clock::now();
    Eigen::Matrix<double, Eigen::Dynamic, Eigen::Dynamic> m1;
    for (size_t i = 0; i < t; ++i)
    {
        m1.setZero(size, size);
    }
    auto t2 = std::chrono::steady_clock::now();
    Eigen::Matrix<double, size, size> m2;
    for (size_t i = 0; i < t; ++i)
    {
        m2.setZero();
    }
    auto t3 = std::chrono::steady_clock::now();

    double t12 = static_cast<std::chrono::duration<double>>(t2 - t1).count();
    double t23 = static_cast<std::chrono::duration<double>>(t3 - t2).count();

    std::cout << "dynamic: " << t12 << " static: " << t23 << std::endl;
    return 0;
}

I found dynamic matrix is always faster than static matrix

Compiled with -O0

dynamic: 2.34759 static: 4.29692

Compiled with -O3

dynamic: 0.0170274 static: 0.0363988

Intuitively, doesn't setZero with dynamic matrix allocate new memory that causes memory allocation overhead?

Solution

TL;DR: the performance results are strongly dependent of the target platform and the compiler.

First of all, this assertion is not always true. Actually, the dynamic version is faster on my machine since I get the result dynamic: 0.128093 static: 0.142624 (using -std=c++20 -O3 -DNDEBUG and GCC 10.2.1). The reason is that GCC/Clang produce a code calling memset in the dynamic version while they generate many direct zeroing SIMD instructions in the static case (more specifically 128-bit SSE instructions). The memset implementation can be faster than the code produced by mainstream compilers (eg. GCC and Clang) in some cases.

Indeed, memset can use wider SIMD instructions if the libc is optimized for the target machine while GCC may not use it. This can be the case for x86-64 platform where AVX-256 and AVX-512 are available, but are not automatically enabled by mainstream compilers because of the required retro-compatibility of the produced binaries on all x86-64 platforms. When the most advanced SIMD instruction set available on the platform is enabled, results tends to be significantly better for the static version, but still, this should not be always true. On my machine, I get dynamic: 0.105638 static: 0.0929698 with -march=native added.

A deeper analysis show that my libc (GNU libc 2.31) do not use SIMD instructions directly, but the rep stos instruction. Modern processors can optimize this instruction and perform stores much wider than 1 byte. However, it has also significant startup overhead. For more information about this instruction and its performance, see this great detailed post.

Other parameters matter for comparing the two implementations: the size/kind of CPU caches and the speed of the RAM as well as the matrix size. Indeed, if the matrix do not fit in CPU caches, then the memset implementation is generally faster because of non-temporal stores (NTS). This is especially true on processors using write-allocate caches (like Intel processors) as such processors will read the matrix from RAM before writing zeros into it without NTS instructions (this processes is up to 2 time slower than just writing directly in memory with NTS instructions).

The thing is GCC/Clang assume that the static matrix is small enough so that it will likely fit in CPU caches (and likely be in the cache because the matrix should be stored on the stack in practice) and thus using classic store instructions in a heavily unrolled loop is likely faster than just using memset, which uses multiple version and select a good one regarding the data size and the target platform. Hopefully, this is often true, hence the performance results you get. Note that this assumption can hardly be made (without profile guided optimizations) when the size is dynamic as it is unknown at compile time.

Note that other parameters can be important like the order of the execution (because of frequency scaling) and memory alignment (although they does not seems to be significant on my machine).