c++assembly eigen matrix-multiplication simd

Why is 4x4 Matrix Multiplication in Eigen More Than Twice as Fast as 3x3?

I compared the performance of 3x3 and 4x4 matrix multiplication using Eigen with the -O3 optimization flag, and surprisingly, I found that the 4x4 case is more than twice as fast as the 3x3 case! This result was unexpected, and I'm curious to understand why.

The benchmark code I used was compiled with the following command: g++ -O3 -I/usr/include/eigen3 tmp.cpp . Here’s the code snippet for the benchmark:

#include <Eigen/Dense>
#include <Eigen/Geometry>
#include <iostream>
#include <vector>
#include <chrono>
#include <random>
#include <valgrind/callgrind.h>

void benchmark_matrix3d_multiplication(const std::vector<Eigen::Matrix3d>& matrices) {
    auto start = std::chrono::high_resolution_clock::now();
    Eigen::Matrix3d result = Eigen::Matrix3d::Identity();
    for(size_t i = 0; i < matrices.size(); i++) {
        asm("# 3dmat mult start");
        result = result * matrices[i];
        asm("# 3dmat mult end");
    }
    auto end = std::chrono::high_resolution_clock::now();
    auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end - start);
    std::cout << "Matrix3d: " << duration.count() << " microseconds" << std::endl;
    std::cout << result << std::endl;  // to avoid compiler optimization
}

void benchmark_matrix4d_multiplication(const std::vector<Eigen::Matrix4d>& matrices4) {
    auto start = std::chrono::high_resolution_clock::now();
    Eigen::Matrix4d result = Eigen::Matrix4d::Identity();
    for(size_t i = 0; i < matrices4.size(); i++) {
        asm("# 4dmat mult start");
        result = result * matrices4[i];
        asm("# 4dmat mult end");
    }
    auto end = std::chrono::high_resolution_clock::now();
    auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end - start);
    std::cout << "Matrix4d: " << duration.count() << " microseconds" << std::endl;
    std::cout << result << std::endl;  // to avoid compiler optimization
}

void benchmark_matrix_multiplication() {
    const int num_matrices = 100000;

    std::vector<Eigen::Matrix3d> matrices(num_matrices);
    for(size_t i = 0; i < num_matrices; i++) {
        auto q = Eigen::Quaterniond::UnitRandom();
        matrices[i] = q.toRotationMatrix();
    }
    std::vector<Eigen::Matrix4d> matrices4(num_matrices);
    for(size_t i = 0; i < num_matrices; i++) {
        matrices4[i].block<3,3>(0,0) = matrices[i];
    }

    CALLGRIND_START_INSTRUMENTATION; // start callgrind
    benchmark_matrix3d_multiplication(matrices);
    benchmark_matrix4d_multiplication(matrices4);
}

int main() {
    benchmark_matrix_multiplication();
}

The result is the following and confirmed that, the two results match each other.

Matrix3d: 2008 microseconds
   0.440386   -0.897765 -0.00889435
   0.808307    0.400777   -0.431298
   0.390768    0.182748    0.902166
Matrix4d: 833 microseconds
   0.440386   -0.897765 -0.00889435           0
   0.808307    0.400777   -0.431298           0
   0.390768    0.182748    0.902166           0
          0           0           0           0

Note that I added asm("# 3dmat mult start") and similar markers to identify the relevant sections of the assembly code. Based on the assembly output, it appears that the multiplication operations are indeed being performed and not optimized away by the compiler. Interestingly, the assembly code for the 4x4 case has more lines of instructions than the 3x3 case, yet it executes faster.

Could someone explain why the 4x4 matrix multiplication is significantly faster in this scenario? Is there a mistake in my measurement, or is there an underlying reason for this result?

Additoinal informations

lscpu result of my machine https://gist.github.com/HiroIshida/7230aab2d40a6cae7b18c552a363f31f
g++ version is: g++ (Ubuntu 13.1.0-8ubuntu1~20.04.2) 13.1.0
asm file https://gist.github.com/HiroIshida/284f3a006b96a6ee9b1004ead96095dd
corresponding asm code for 3x3 multiplication extracted from the above https://gist.github.com/HiroIshida/5a3bf192b58467a0fba823b9489c0421
corresponding asm code for 4x4 multiplication extracted from the above https://gist.github.com/HiroIshida/9176c7a91e30f7e426e927412740f615
I didn't dig deeper, but in my ARM machine, expectedly, the 3x3 matrix multiplication is faster than 4x4

Solution

TL;DR: there are several issues happening simultaneously, but the main one is that the 3D version has a higher latency (due to the scheduling and register file issues) which cannot be overlapped with computation due to the loop-carried dependency. Changing the order of the operations helps the CPU to hide the latency of the 3D version so it can be faster than the 4D one. Other issues impacting the stability of the benchmark are frequency scaling (and more generally power management) and memory alignment.

Setting up a good benchmark

I can reproduce this on an AMD Ryzen 7 5700U CPU (Zen2) with GCC 13.2.0 on Ubuntu 24.04. The performance results are very unstable for one run (since the benchmark is too short). That being said, when running the program many times (e.g., 1000 times), interesting things start to happen and the weird effect surprisingly appears!

First of all, here are the results on my machine for 100 runs without any modification of the program:

Note that the vertical axis is the reported time (in µs) and the horizontal axis is the number of iterations.

We can see that the 3D version is less stable than the 4D version, but the noise is so big that we cannot conclude anything (maybe just that the 3D version is less stable than the 4D one and there is a kind of strange threshold in the timings of the 3D version).

If we increase the number of runs to 1000, then we get surprising results:

Ok, now results are far more stable (less random), more deterministic (similar results between many group of runs) and also more reliable (because the benchmark runs for dozens of seconds).

We can now say that the 3D version is indeed slower on average and less stable than the 4D version. Interestingly, we can see that the 4D version is surprisingly faster after a given time contrary to the 3D version.

Unrelated parameters

Note that using -march=native makes computations faster (thanks to AVX/AVX2) but it has no impact on the issues overall. Also, please note that -march=native and -mavx2 -mfma give the same performance results.

Note that a similar effect can be seen with 20 times bigger vectors. The effect is also independent of the thread binding so this is certainly not due to context switches or NUMA effects. Changing the order of the function calls did not result in any significant performance impact. The scaling governor is set to performance on all cores (which may have a variable frequency though). Could it be due to the frequency scaling?

Impact of the frequency (scaling) and power management

Well, it seems frequency scaling somehow plays a role in the initial threshold for the 4D version. I set the minimum and maximum frequency to the same value (800MHz on my machine because of a bug with the scaling governor on my machine that does not care about the minimum frequency I can actually set, so the minimum frequency tends to always be 800 MHz for active cores). I checked that the frequency was stable during the run (at least at a coarse granularity). Here is the result:

We can see that there is no threshold anymore for the 4D version! More generally, I observed two effects when trying different frequencies:

a lower frequency reduces the gap between high timings (the first ones) and low timings (the later ones after the threshold) of the 4D version;
a lower frequency makes the threshold delay (i.e., number of iterations before the threshold) smaller;
more items in the vectors also results in a smaller threshold delay.

Regarding the last two points, this is actually because the threshold delay seems to be a fixed threshold delay independent of the frequency (about 2.5 seconds), so a lower frequency makes each iteration slower causing a smaller delay. All of this proves that frequency impacts the threshold though I cannot explain why. My guess is that this is related to the power management of AMD Zen CPUs. I expect this constant to be either set in the kernel (possibly in vendor dependent files) or a hardware/firmware constant based on off-core frequency (independent of the core frequency).

I also found out (lately) that the effect does not happen when the laptop is charging, though the performance is a bit less stable! This confirms this issue is directly related to the power management of the CPU.

Impact of the memory alignment

When the execution time of a function doing basic numerical computation is quantized (and the effect is not due to frequency), the quantization is often caused by memory alignment issues. This is what happens for the 3D version, typically because a double-precision 3x3 matrix takes 3*3*8=72 bytes (as pointed by Raildex in the comments). In practice, compilers like GCC can add some padding at the end of the structure for memory accesses to be typically faster (resulting in 80-byte 3x3 matrices). Note that this is not a power of two. Moreover, a 3x3 matrix typically crosses two cache lines causing additional loads from the L1 cache during unaligned accesses. Such an additional load introduction results in a higher latency. Even worse: the latency overhead is not constant, so I think this might make it harder for the CPU instruction scheduler to schedule the target instructions efficiently. To check the memory-alignment hypothesis, we can simply align each matrix on a cache line. This is not really efficient in terms of memory space (so it can decrease performance of memory-bound codes), but it should at least fix this alignment issue. Here is the code to force matrices to be aligned in memory:

// Align each matrix to 64 bytes (i.e., a cache line)
class alignas(64) Matrix3d : public Eigen::Matrix3d
{
public:
    Matrix3d() = default;
    Matrix3d(const Matrix3d&) = default;
    Matrix3d(Matrix3d&&) = default;
    template<typename MatrixType>
    Matrix3d(const MatrixType& other) : Eigen::Matrix3d(other) {}
    template<typename MatrixType>
    Matrix3d& operator=(const MatrixType& other) { Eigen::Matrix3d::operator=(other); return *this; }
};

class alignas(64) Matrix4d : public Eigen::Matrix4d
{
public:
    Matrix4d() = default;
    Matrix4d(const Matrix4d&) = default;
    Matrix4d(Matrix4d&&) = default;
    template<typename MatrixType>
    Matrix4d(const MatrixType& other) : Eigen::Matrix4d(other) {}
    template<typename MatrixType>
    Matrix4d& operator=(const MatrixType& other) { Eigen::Matrix4d::operator=(other); return *this; }
};

// Replace all occurrences of Eigen::Matrix by Matrix in the rest of the code
// [...]

Here are the performance results:

We can see that the timings of the 3D version are not quantized anymore. Note that the timing is unfortunately the same as the upper threshold in previous benchmarks. One can also note that the 4D version is a bit slower (possibly due to a less efficient generated code). Strangely, the timings of the 4D version are less stable. I expect the alignment of the array allocated by std::vector to impact such timings.

Once combined with the fixed (low) frequency, we get the following performance results:

Results are now stable, but there is still one big issue: why is the 4D version still faster than the 3D one?

Instruction scheduling, latency and dependencies

Your functions have a special property: each iteration of the hot loop is dependent on the previous one. As a result, it is much more difficult for the CPU to hide the latency of the instructions. It is also difficult to find enough instructions to execute in parallel. Because of that, the performance of the code should be very sensitive to the latency of the instructions and also any overhead impacting it. I think this is why there are so many strange effects on this simple code and also why the results are so unstable.

The scheduling of the instructions can have a strong impact on performance. This is especially true on Zen2 CPU which AFAIK uses multiple separate FP schedulers (not a unified scheduler). Moreover, some instructions can only be scheduled on some specific ports. Thus, if the scheduler makes a bad decision by not queuing instructions to the best queue for a given iteration, the latency can slightly increase (probably by a few cycles), so the overall execution will be slower. A few cycles seems small but a few cycles per iteration when each iteration lasts dozens of cycles is a lot. Here, on the last 800MHz benchmark, we can see that the 4D version takes about 4800 µs to execute 100_000 iterations. This means ~38 cycles/iteration (while there are about 140-150 instructions/iteration to execute)!

In practice, hardware performance counters tend to indicate poor back-end resource usage (on a modified example meant to measure each version separately):

Use of the 4 FP SIMD ports (i.e., execution units):
  - 3D version:
       467 703 544      fpu_pipe_assignment.total0                                            
       940 583 576      fpu_pipe_assignment.total1                                            
        22 580 911      fpu_pipe_assignment.total2                                            
        10 002 224      fpu_pipe_assignment.total3  
  - 4D version:
       334 752 204      fpu_pipe_assignment.total0                                            
       280 487 188      fpu_pipe_assignment.total1                                            
       269 800 454      fpu_pipe_assignment.total2                                            
       268 277 318      fpu_pipe_assignment.total3

In the 3D version, two ports are nearly not used at all. For the two remaining ports, one is twice as saturated as the other! This is really bad. Meanwhile, the 4D version performs well. This tends to confirm instruction scheduling issues.

The hardware performance counters also indicate the reason for dispatched instruction stalls. Here are the two main sources of stalls in the two versions:

Source of stalls (in cycles):
  - 3D version:
     7 936 000 400      cycles                                                                
     1 609 456 587      de_dis_dispatch_token_stalls1.fp_sch_rsrc_stall                                      
     3 998 358 801      de_dis_dispatch_token_stalls1.fp_reg_file_rsrc_stall                                       
  - 4D version:
     6 950 481 062      cycles
       194 523 072      de_dis_dispatch_token_stalls1.fp_sch_rsrc_stall                                      
     2 858 781 759      de_dis_dispatch_token_stalls1.fp_reg_file_rsrc_stall

fp_sch_rsrc_stall specifies "cycles where a dispatch group is valid but does not get dispatched due to a token stall. FP scheduler resource stall.". While fp_reg_file_rsrc_stall is similar but for "floating point register file resource stall.". Thus, on the 3D version, while some stalls (20%) happen due to scheduling issues, the main source of stalls (50%) seems to come from the FP register file. Generally, this is due to missing entries in the FP register file, but it would be weird here since Zen2 is supposed to have 160 FP registers (which seems sufficient for this code). I wonder if it could be somehow due to speculative execution of FP instructions. On the 4D version, there are also a lot of FP register stalls (41%), and nearly no stalls due to the scheduling. While I do not fully understand the reason why there are so many FP register file stalls, the later point confirms the above hypothesis.

Faster/Better code

In practice you should really avoid writing such code for several reasons. First of all, it is certainly not numerically stable (because of the large number of matrix multiplications performed iteratively).

Moreover, the dependency between iterations typically decreases performance. A key to speeding up this computation is to compute a pair-wise reduction (i.e., something like (A*A)*(A*A)). This helps the CPU to compute multiple matrix multiplications concurrently (possibly even in parallel), reducing scheduling stalls. Besides, a pair-wise reduction is much more numerically stable. This improvement make the 3D version faster compared to the 4D version! Here are performance results when matrices are computed pair by pair (for both versions):

We can see that the 3D version is now faster and the 4D version is not affected much by this improvement. Note that computing matrices by groups of 4 decreases performance since the generated code seems to be less efficient for my CPU (due to instruction scheduling and too many required registers).

FMA instructions (i.e., instructions fusing a multiplication with an addition) can increase the accuracy of the computation while reducing the latency of the code. You can specify -mfma to the compiler so as to generate FMA instructions. However, as stated previously, it does not impact the difference between the 3D and 4D version.

Adding padding in the 3D matrix can help to perform possibly fewer (aligned) loads/stores, avoiding some latency overheads (AFAIK storing data to a cache-line can impact the speed of reads on the same cache line but at different locations on some CPUs). Compilers tends not to do that because they need to prove the access is safe and operations do not cause any side effects on non-masked values (not to mention masking can be expensive on some platforms).

For basic 3x3 and 4x4 matrices, the best solution might be to manually write assembly code for the target architecture despite the many downsides (not simple, tedious to write, not really maintainable, not portable).