Consider the following code in Python, where multiplying a pre-transposed matrix yields faster execution time compared to multiplying a non-transposed matrix:
import numpy as np
import time
# Generate random matrix
matrix_size = 1000
matrix = np.random.rand(matrix_size, matrix_size)
# Transpose the matrix
transposed_matrix = np.transpose(matrix)
# Multiply non-transposed matrix
start = time.time()
result1 = np.matmul(matrix, matrix)
end = time.time()
execution_time1 = end - start
# Multiply pre-transposed matrix
start = time.time()
result2 = np.matmul(transposed_matrix, transposed_matrix)
end = time.time()
execution_time2 = end - start
print("Execution time (non-transposed):", execution_time1)
print("Execution time (pre-transposed):", execution_time2)
Surprisingly, multiplying the pre-transposed matrix is faster. One might assume that the order of multiplication should not affect the performance significantly, but there seems to be a difference.
Why does processing a pre-transposed matrix result in faster execution time compared to a non-transposed matrix? Is there any underlying reason or optimization that explains this behavior?
I've taken the comments about the cache
into consideration and I'm generating new matrices on each loop:
import numpy as np
import time
import matplotlib.pyplot as plt
# Generate random matrices
matrix_size = 3000
# Variables to store execution times
execution_times1 = []
execution_times2 = []
# Perform matrix multiplication A @ B^T and measure execution time for 50 iterations
num_iterations = 50
for _ in range(num_iterations):
matrix_a = np.random.rand(matrix_size, matrix_size)
start = time.time()
result1 = np.matmul(matrix_a, matrix_a)
end = time.time()
execution_times1.append(end - start)
# Perform matrix multiplication A @ B and measure execution time for 50 iterations
for _ in range(num_iterations):
matrix_b = np.random.rand(matrix_size, matrix_size)
start = time.time()
result2 = np.matmul(matrix_b, matrix_b.T)
end = time.time()
execution_times2.append(end - start)
# Print average execution times
avg_execution_time1 = np.mean(execution_times1)
avg_execution_time2 = np.mean(execution_times2)
#print("Average execution time (A @ B^T):", avg_execution_time1)
#print("Average execution time (A @ B):", avg_execution_time2)
# Plot the execution times
plt.plot(range(num_iterations), execution_times1, label='A @ A')
plt.plot(range(num_iterations), execution_times2, label='B @ B.T')
plt.xlabel('Iteration')
plt.ylabel('Execution Time')
plt.title('Matrix Multiplication Execution Time Comparison')
plt.legend()
plt.show()
# Display BLAS configuration
np.show_config()
Results:
blas_mkl_info:
libraries = ['mkl_rt']
library_dirs = ['C:/Users/User/anaconda3\\Library\\lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['C:/Users/User/anaconda3\\Library\\include']
blas_opt_info:
libraries = ['mkl_rt']
library_dirs = ['C:/Users/User/anaconda3\\Library\\lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['C:/Users/User/anaconda3\\Library\\include']
lapack_mkl_info:
libraries = ['mkl_rt']
library_dirs = ['C:/Users/User/anaconda3\\Library\\lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['C:/Users/User/anaconda3\\Library\\include']
lapack_opt_info:
libraries = ['mkl_rt']
library_dirs = ['C:/Users/User/anaconda3\\Library\\lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['C:/Users/User/anaconda3\\Library\\include']
Supported SIMD extensions in this NumPy install:
baseline = SSE,SSE2,SSE3
found = SSSE3,SSE41,POPCNT,SSE42,AVX,F16C,FMA3,AVX2
not found = AVX512F,AVX512CD,AVX512_SKX,AVX512_CLX,AVX512_CNL
It doesn't seem really obvious on my machine.
On 1000 runs. I get these timings (x=non transposed, y=transposed). There are more red dots (under the y=x axis) than blue dots. 685/315 to be more accurate. So, p-value wise, no doubt, that cannot be just random effect. (1000 coins drawn, with 685 heads is a clear anomaly)
But timing-wise, it is not obvious. The cluster is mainly centered on the y=x axis.
Now I started this answer because I was pretty sure that this was a cache problem. When I was in engineer school (a very long time ago, when those considerations where even more important they are now, and taught by teachers who, themselves date back from a time when it was even more important), in HPC lessons, we were taught to be very careful when switching from Fortran to C, because of cache effect: when iterating an array, it is very important to interate it in the order it is in memory (which in numpy is still called either "C" order vs "fortran" order, proof that it is still an important consideration for people who care more that I do - I rarely need to care in my everyday job, hence the reason I invoke school memory and not job memory).
Because when dealing with the number that is right next to the one you've just processed before in memory, that number is probably already loaded in cache memory. While if the next number you process is 1 row under (in C order, so further in memory), then, it is more likely that it is not in cache. With nowadays cache size, it takes big matrix so that it makes a difference tho.
Since transpose
doesn't move any data, and just adjust the strides, the effect of working on transposed matrix is that you change the order in memory of the processed data. So, if you consider the naive algorithm
for i in range(N):
for j in range(N):
res[i,j]=0
for k in range(N):
res[i,j] += A[i,k] * B[k,j]
if A
and B
are in C order, then iteration of matrix A is done in memory order (we iterate along a row, columns by columns, so adjacent number in memory one after the other), while B
is not.
If that order is inversed, for example, because they have been transposed, then it is the reverse. It is B that is iterated in the order that won't pose a cache problem and A that is not.
Well, no need to stay too long on this, since I tell all that to explain why I wanted to investigate possibility of a cache problem (my intend was to compare the same multiplication with a copy of a transposed matrix, so that it is the same matrix multiplication, with only order changing. And also to try to see if there is a threshold in matrix size under which the phenomenon is not visible, which would also validate cache problem, since, for this to matter, the whole matrix must not fit in cache)
But, the first step while doing so, is also to start to avoid bias, because first computation use data not yet in cache, while second use data already in cache (especially in the case where the whole matrix fits in cache).
So, here is the first thing I've tried: just inverted computation order. Compute fist on transposed_matrix, and then on matrix.
This time, shifting is in favor of blue dots (and, of course, I've changed only the computation order, not the meaning of axis. So x is still matrix@matrix
timing, and y still transposed_matrix
The number of red dots this time is 318 vs 682. So, almost exactly the opposite as before.
So, conclusion (valid at least for my machine): this is indeed a cache problem. But a cache problem caused only by the fact that there is a bias in favor of transposed_matrix
: it is already in cache (since the data are the same as the data of matrix), when you use it to compute.
As I said in comment (but since the question was updated, and this answer, already quite upvoted, I think it is important that it also appears in that answer, for future readers), the update is something different.
Your first question was about A@A
vs [email protected]
. The second appeared to be faster. But it was only with 1 single operation. So the reason, as I've shown, was just due to the fact that when second operation is done, A
is already in cache memory (which was not the case, when first operation was done). Because A.T
is the same data as A
(not a copy. But the same data, at same memory address).
My previous answer shows that if you reverse, and compute [email protected]
fist, then A@A
, then it is, on the contrary [email protected]
that is slower, and in the exact same proportion.
Another way to show it is
import numpy as np
import timeit
A=np.random.normal(0,1,(1000,1000))
B=A.copy()
A@A
print(timeit.timeit(lambda: A@A, number=20))
[email protected]
print(timeit.timeit(lambda: [email protected], number=20))
B@B
print(timeit.timeit(lambda: B@B, number=20))
(The fact to perform A@A
before timeit, is just to ensure that first of the 20 computations is not slower because of cache consideration)
On my computer, all those operation takes 1 second almost exactly (the number=20
was chosen so that it takes 1 second)
This time, no cache effect, because we run things 21 times each, not counting time at the 1st run. And no influence of .T
Now, for your question update, that is something else
[email protected]
print(timeit.timeit(lambda: [email protected], number=20))
A.T@A
print(timeit.timeit(lambda: A.T@A, number=20))
[email protected]
print(timeit.timeit(lambda: [email protected], number=20))
This times, the 1st two operations takes only 650 ms. And not because of cache: it is the same whatever the order of those operations.
That is because numpy is able to detect that A.T
and A
are the same matrix, but with one transposition operation.
(It is quite easy for it to detect so: it is the same data address, but strides and shape (well here shape is square anyway; but more importantly, strides is inverted) are inverted: A.strides
→ (8000,8)
, A.T.strides
→ (8, 8000)
.
So, it is easy for numpy to realize that this is a [email protected]
situation. And therefore to apply an algorithm that computes that faster. As said in comment (and said before I did by others in comments, but also by others, days ago, who misread your first question... but were right to do so, since they answered in advance to what is now the update): [email protected]
is symmetrical. So, there are some easy correction here.
Note that
timeit.timeit(lambda: A@B, number=20)
timeit.timeit(lambda: [email protected], number=20)
are both 1 second (as A@A
, and [email protected]
are). So, it is easy to understand that A@B
, A@A
, [email protected]
all just use one standard "matrix multiplication" algorithm. Which [email protected]
, A.T@A
use a faster one.
Since B
is a copy of A
, [email protected]
has the same symmetrical result as [email protected]
. But this times, because it is a copy, numpy cannot realize that it is a [email protected]
situation, cannot realize that it is a symmetrical result (before the result is computed). So [email protected]
has the same standard "1 second" timing as A@A
. While [email protected]
has not.
Which confirms that it does rely on the same address, inverted strides
criteria. As long as it is either not the same address, or not the same strides inverted, standard algorithm, 1 second. If it is both the same address, but strides inverted, then, special algorithm, 650 ms.