python numpy math matrix mathematical-optimization

Explicit matrix multiplication much faster than numpy.matmul?

In a Python code, I need at some point to multiply two large lists of 2x2 matrices, individually. In the code, both these lists are numpy arrays with shape (n,2,2). The expected result in another (n,2,2) array, where the matrix 1 is the result of the multiplication between matrix 1 of the first list and matrix 1 of the second list, etc.

After some profiling, I found that matrix multiplication was a performance bottleneck. Out of curiosity, I tried writing the matrix multiplication "explicitly". Below is a code example with measured runtimes.

import timeit
import numpy as np

def explicit_2x2_matrices_multiplication(
    mats_a: np.ndarray, mats_b: np.ndarray
) -> np.ndarray:
    matrices_multiplied = np.empty_like(mats_b)
    for i in range(2):
        for j in range(2):
            matrices_multiplied[:, i, j] = (
                mats_a[:, i, 0] * mats_b[:, 0, j] + mats_a[:, i, 1] * mats_b[:, 1, j]
            )

    return matrices_multiplied


matrices_a = np.random.random((1000, 2, 2))
matrices_b = np.random.random((1000, 2, 2))

assert np.allclose( # Checking that the explicit version is correct 
    matrices_a @ matrices_b,
    explicit_2x2_matrices_multiplication(matrices_a, matrices_b),
)

print(  # 1.1814142999992328 seconds
    timeit.timeit(lambda: matrices_a @ matrices_b, number=10000)
)
print(  # 1.1954495010013488 seconds
    timeit.timeit(lambda: np.matmul(matrices_a, matrices_b), number=10000)
)
print(  # 2.2304022700009227 seconds
    timeit.timeit(lambda: np.einsum('lij,ljk->lik', matrices_a, matrices_b), number=10000)
)
print(  # 0.19581600800120214 seconds
    timeit.timeit(
        lambda: explicit_2x2_matrices_multiplication(matrices_a, matrices_b),
        number=10000,
    )
)

As tested in the code, this function produces the same results as a regular matrix __matmul__ result. However what is not the same is the speed: on my machine, the explicit expression is up to 10 times faster.

This is quite a surprising result to me. I would have expected the numpy expression to be faster or at least comparable to the longer Python version, not an order of magnitude slower as we see here. I would be curious for insights as to why the difference in performance is so drastic.

I am running numpy version 1.25, and Python version 3.10.6.

Solution

TL;DR: all the methods provided in the question are very inefficient. In fact, Numpy is clearly sub-optimal and there is no way to compute this efficiently only with Numpy. Still, there are faster solutions than the ones provided in the question.

Explanation and faster implementation

The Numpy code make use of powerful generic iterators to apply a given computation (like a matrix multiplication) to multiple array slice. Such iterators are useful to implement broadcasting and also to generate a relatively simple implementation of einsum. The thing is they are also quite expensive when the number of iteration is huge and the array are small. This is exactly what happen in your use-case. While the overhead of Numpy's iterators can be mitigated by optimizing the Numpy code, there is no way to reduce the overhead to a negligible time in this specific use-case. Indeed, there is only 12 floating-point operations to perform per matrix. A relatively recent x86-64 processor can be able to compute each matrix in less than 10 nanoseconds using scalar FMA units. In fact, one can use SIMD instruction so to compute each matrix in only few nanosecond.

First thing first, we can mostly remove the overhead of the internal Numpy iterator by doing the matrix multiplication ourself operating on vectors along the first axis. This is exactly what explicit_2x2_matrices_multiplication does!

While explicit_2x2_matrices_multiplication should be significantly faster, it is still sub-optimal: it performes non-contiguous memory reads, creates several useless temporary arrays and each Numpy call introduce a small overhead. A faster solution is to write a Numba/Cython code so the underlying compiler can generate a very good sequence of instructions optimized for the 2x2 matrices. Good compilers can even generate SIMD instructions in this case which is something impossible for Numpy. Here is the resulting code:

import numba as nb
@nb.njit('(float64[:,:,::1], float64[:,:,::1])')
def compute_fastest(matrices_a, matrices_b):
    assert matrices_a.shape == matrices_b.shape
    assert matrices_a.shape[1] == 2 and matrices_a.shape[2] == 2

    n = matrices_a.shape[0]
    result_matrices = np.empty((n, 2, 2))
    for k in range(n):
        for i in range(2):
            for j in range(2):
                result_matrices[k,i,j] = matrices_a[k,i,0] * matrices_b[k,0,j] + matrices_a[k,i,1] * matrices_b[k,1,j]

    return result_matrices

Performance results

Here are performance results on my machine with a i5-9600KF CPU for 1000x2x2 matrices:

Naive einsum:                           214   µs
matrices_a @ matrices_b:                102   µs
explicit_2x2_matrices_multiplication:    24   µs
compute_fastest:                          2.7 µs   <-----

Discussion

The Numba implementation reach 4.5 GFlops. Each matrix is computed in only 12 CPU cycles (2.7 ns)! My machine is able to reach up to 300 GFlops in practice (theoretically 432 GFlops), but only 50 GFlops with 1 core and 12.5 GFlops using a scalar code (theoretically 18 GFlops). The granularity of the operation is too small for multiple thread to be useful (the overhead of spawning thread is at least few microseconds). Besides, a SIMD code can hardly saturate de FMA units because the input data layout requires SIMD shuffles so 50 GFlops is actually an optimistic upper bound. As a result, we can safely say that the Numba implementation is a pretty efficient implementation. Still, a faster code can be written thanks to SIMD instructions (I expect a speed-up of about x2 in practice). That being said, writing a native code using SIMD intrinsics of helping the compiler to generate a fast SIMD code is really far from being easy (not to mention the final code will be ugly and hard to maintain). Thus, an SIMD implementation might not worth the effort.