I've read other posts on how python speed/performance should be relatively unaffected by whether code being run is just in main, in a function or defined as a class attribute, but these do not explain the very large differences in performance that I see when using class vs local variables, especially when using the numpy library. To be more clear, I made an script example below.
import numpy as np
import copy
class Test:
def __init__(self, n, m):
self.X = np.random.rand(n,n,m)
self.Y = np.random.rand(n,n,m)
self.Z = np.random.rand(n,n,m)
def matmul1(self):
self.A = np.zeros(self.X.shape)
for i in range(self.X.shape[2]):
self.A[:,:,i] = self.X[:,:,i] @ self.Y[:,:,i] @ self.Z[:,:,i]
return
def matmul2(self):
self.A = np.zeros(self.X.shape)
for i in range(self.X.shape[2]):
x = copy.deepcopy(self.X[:,:,i])
y = copy.deepcopy(self.Y[:,:,i])
z = copy.deepcopy(self.Z[:,:,i])
self.A[:,:,i] = x @ y @ z
return
t1 = Test(300,100)
%%timeit
t1.matmul1()
#OUTPUT: 20.9 s ± 1.37 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
t1.matmul2()
#OUTPUT: 516 ms ± 6.49 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In this script I define a class with attributes X, Y and Z as 3-way arrays. I also have two function attributes (matmul1 and matmul2) which loop through the 3rd index of the arrays and matrix multiply each of the 3 slices to populate an array, A. matmul1 just loops through class variables and matrix multiplies, whereas matmul2 creates local copies for each matrix multiplication within the loop. Matmul1 is ~40X slower than matmul2. Can someone explain why this is happening? Maybe I am thinking about how to use classes incorrectly, but I also wouldn't assume that variables should be deep copied all the time. Basically, what is it about deep copying that affects my performance so significantly, and is this unavoidable when using class attributes/variables? It seems like its more than just the overhead of calling class attributes as discussed here. Any input is appreciated, thanks!
Edit: My real question is why do copies of, instead of views of subarrays of class instance variables, result in much better performance for these types of methods.
If you put the m
dimension first, you could do this product without iteration:
In [146]: X1,Y1,Z1 = X.transpose(2,0,1), Y.transpose(2,0,1), Z.transpose(2,0,1)
In [147]: A1 = X1@Y1@Z1
In [148]: np.allclose(A, A1.transpose(1,2,0))
Out[148]: True
However sometimes, working with very large arrays is slower, due to memory management complexities.
It might worth testing
A1[i] = X1[i] @ Y1[i] @ Z1[i]
where the iteration is on the outermost dimension.
My computer is too small to do good timings on these array sizes.
I added these alternatives to your class, and tested with a smaller case:
In [67]: class Test:
...: def __init__(self, n, m):
...: self.X = np.random.rand(n,n,m)
...: self.Y = np.random.rand(n,n,m)
...: self.Z = np.random.rand(n,n,m)
...: def matmul1(self):
...: A = np.zeros(self.X.shape)
...: for i in range(self.X.shape[2]):
...: A[:,:,i] = self.X[:,:,i] @ self.Y[:,:,i] @ self.Z[:,:,i]
...: return A
...: def matmul2(self):
...: A = np.zeros(self.X.shape)
...: for i in range(self.X.shape[2]):
...: x = self.X[:,:,i].copy()
...: y = self.Y[:,:,i].copy()
...: z = self.Z[:,:,i].copy()
...: A[:,:,i] = x @ y @ z
...: return A
...: def matmul3(self):
...: x = self.X.transpose(2,0,1).copy()
...: y = self.Y.transpose(2,0,1).copy()
...: z = self.Z.transpose(2,0,1).copy()
...: return (x@y@z).transpose(1,2,0)
...: def matmul4(self):
...: x = self.X.transpose(2,0,1).copy()
...: y = self.Y.transpose(2,0,1).copy()
...: z = self.Z.transpose(2,0,1).copy()
...: A = np.zeros(x.shape)
...: for i in range(x.shape[0]):
...: A[i] = x[i]@y[i]@z[i]
...: return A.transpose(1,2,0)
In [68]: t1=Test(100,50)
In [69]: np.max(np.abs(t1.matmul2()-t1.matmul4()))
Out[69]: 0.0
In [70]: np.allclose(t1.matmul3(),t1.matmul2())
Out[70]: True
The view
iteration is 10x slower:
In [71]: timeit t1.matmul1()
252 ms ± 424 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [72]: timeit t1.matmul2()
26 ms ± 475 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
The additions are about the same:
In [73]: timeit t1.matmul3()
30.8 ms ± 4.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [74]: timeit t1.matmul4()
27.3 ms ± 172 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Without the copy()
, the transpose
produces a view, and times are similar to matmul1
(250ms).
My guess is that with "fresh" copies, matmul
is able to pass them to the best BLAS function by reference. With views, as in matmul1
, it has to take some sort of slower route.
But if I use dot
instead of matmul
, I get the faster time, even with the matmul1
iteation.
In [77]: %%timeit
...: A = np.zeros(X.shape)
...: for i in range(X.shape[2]):
...: A[:,:,i] = X[:,:,i].dot(Y[:,:,i]).dot(Z[:,:,i])
25.2 ms ± 250 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
It sure looks like matmul
with views is taking some suboptimal calculation choice.