I am computing a matrix multiplication at few thousand times during my algorithm. Therefore, I compute:
import numpy as np
import time
def mat_mul(mat1, mat2, mat3, mat4):
return(np.dot(np.transpose(mat1),np.multiply(np.diag(mat2)[:,None], mat3))+mat4)
n = 2000
mat1 = np.random.rand(n,n)
mat2 = np.diag(np.random.rand(n))
mat3 = np.random.rand(n,n)
mat4 = np.random.rand(n,n)
t0=time.time()
cov_11=mat_mul(mat1, mat2, mat1, mat4)
t1=time.time()
print('time ',t1-t0, 's')
The matrices are of size: n = (2000,2000) and mat2 only has entries along its diagonal. The remaining entries are zero.
On my machine I get the following:
time 0.3473696708679199 s
How can I speed this up?
Thanks.
The Numpy implementation can be optimized a bit by reducing the amount of temporary arrays and reuse them as much as possible (ie. multiple times). Indeed, while matrix multiplications are generally heavily-optimized by BLAS implementations, filling/copying (newly allocated) arrays add a non-negligible overhead.
Here is the implementation:
def mat_mul_opt(mat1, mat2, mat3, mat4):
tmp1 = np.empty((n,n))
tmp2 = np.empty((n,n))
vect = np.diag(mat2)[:,None]
np.dot(np.transpose(mat1),np.multiply(vect, mat3, out=tmp1), out=tmp2)
np.add(mat4, tmp2, out=tmp1)
return tmp1
The code can be optimized further if it is fine to mutate input matrices or if you can pre-allocate tmp1
and tmp2
outside the function once (and then reuse them multiple times). Here is an example:
def mat_mul_opt2(mat1, mat2, mat3, mat4, tmp1, tmp2):
vect = np.diag(mat2)[:,None]
np.dot(np.transpose(mat1),np.multiply(vect, mat3, out=tmp1), out=tmp2)
np.add(mat4, tmp2, out=tmp1)
return tmp1
Here are performance results on my i5-9600KF processor (6-cores):
mat_mul: 103.6 ms
mat_mul_opt1: 96.7 ms
mat_mul_opt2: 83.5 ms
np.dot time only: 74.4 ms (kind of practical lower-bound)
Optimal lower bound: 55 ms (quite optimistic)