python numpy performance cython matrix-multiplication

Speed-up numpy matrix multiplication using cython

I am computing a matrix multiplication at few thousand times during my algorithm. Therefore, I compute:

import numpy as np
import time


def mat_mul(mat1, mat2, mat3, mat4):
    return(np.dot(np.transpose(mat1),np.multiply(np.diag(mat2)[:,None], mat3))+mat4)

n = 2000
mat1 = np.random.rand(n,n)
mat2 = np.diag(np.random.rand(n))
mat3 = np.random.rand(n,n)
mat4 = np.random.rand(n,n)

t0=time.time()
cov_11=mat_mul(mat1, mat2, mat1, mat4)
t1=time.time()
print('time ',t1-t0, 's')

The matrices are of size: n = (2000,2000) and mat2 only has entries along its diagonal. The remaining entries are zero.

On my machine I get the following: time 0.3473696708679199 s

How can I speed this up?

Thanks.

Solution

The Numpy implementation can be optimized a bit by reducing the amount of temporary arrays and reuse them as much as possible (ie. multiple times). Indeed, while matrix multiplications are generally heavily-optimized by BLAS implementations, filling/copying (newly allocated) arrays add a non-negligible overhead.

Here is the implementation:

def mat_mul_opt(mat1, mat2, mat3, mat4):
    tmp1 = np.empty((n,n))
    tmp2 = np.empty((n,n))
    vect = np.diag(mat2)[:,None]
    np.dot(np.transpose(mat1),np.multiply(vect, mat3, out=tmp1), out=tmp2)
    np.add(mat4, tmp2, out=tmp1)
    return tmp1

The code can be optimized further if it is fine to mutate input matrices or if you can pre-allocate tmp1 and tmp2 outside the function once (and then reuse them multiple times). Here is an example:

def mat_mul_opt2(mat1, mat2, mat3, mat4, tmp1, tmp2):
    vect = np.diag(mat2)[:,None]
    np.dot(np.transpose(mat1),np.multiply(vect, mat3, out=tmp1), out=tmp2)
    np.add(mat4, tmp2, out=tmp1)
    return tmp1

Here are performance results on my i5-9600KF processor (6-cores):

mat_mul:                 103.6 ms
mat_mul_opt1:             96.7 ms
mat_mul_opt2:             83.5 ms
np.dot time only:         74.4 ms   (kind of practical lower-bound)
Optimal lower bound:      55   ms   (quite optimistic)