Why is performing matrix multiplication on a pre-transposed matrix faster than on a non-transposed matrix?

Consider the following code in Python, where multiplying a pre-transposed matrix yields faster execution time compared to multiplying a non-transposed matrix:

import numpy as np
import time

# Generate random matrix
matrix_size = 1000
matrix = np.random.rand(matrix_size, matrix_size)

# Transpose the matrix
transposed_matrix = np.transpose(matrix)

# Multiply non-transposed matrix
start = time.time()
result1 = np.matmul(matrix, matrix)
end = time.time()
execution_time1 = end - start

# Multiply pre-transposed matrix
start = time.time()
result2 = np.matmul(transposed_matrix, transposed_matrix)
end = time.time()
execution_time2 = end - start

print("Execution time (non-transposed):", execution_time1)
print("Execution time (pre-transposed):", execution_time2)

Surprisingly, multiplying the pre-transposed matrix is faster. One might assume that the order of multiplication should not affect the performance significantly, but there seems to be a difference.

Why does processing a pre-transposed matrix result in faster execution time compared to a non-transposed matrix? Is there any underlying reason or optimization that explains this behavior?

UPDATE

I've taken the comments about the cache into consideration and I'm generating new matrices on each loop:

import numpy as np
import time
import matplotlib.pyplot as plt

# Generate random matrices
matrix_size = 3000



# Variables to store execution times
execution_times1 = []
execution_times2 = []

# Perform matrix multiplication A @ B^T and measure execution time for 50 iterations
num_iterations = 50
for _ in range(num_iterations):
    matrix_a = np.random.rand(matrix_size, matrix_size)
    start = time.time()
    result1 = np.matmul(matrix_a, matrix_a)
    end = time.time()
    execution_times1.append(end - start)

# Perform matrix multiplication A @ B and measure execution time for 50 iterations
for _ in range(num_iterations):
    matrix_b = np.random.rand(matrix_size, matrix_size)
    start = time.time()
    result2 = np.matmul(matrix_b, matrix_b.T)
    end = time.time()
    execution_times2.append(end - start)

# Print average execution times
avg_execution_time1 = np.mean(execution_times1)
avg_execution_time2 = np.mean(execution_times2)
#print("Average execution time (A @ B^T):", avg_execution_time1)
#print("Average execution time (A @ B):", avg_execution_time2)

# Plot the execution times
plt.plot(range(num_iterations), execution_times1, label='A @ A')
plt.plot(range(num_iterations), execution_times2, label='B @ B.T')
plt.xlabel('Iteration')
plt.ylabel('Execution Time')
plt.title('Matrix Multiplication Execution Time Comparison')
plt.legend()
plt.show()

# Display BLAS configuration
np.show_config()

Results:

blas_mkl_info:
    libraries = ['mkl_rt']
    library_dirs = ['C:/Users/User/anaconda3\\Library\\lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['C:/Users/User/anaconda3\\Library\\include']
blas_opt_info:
    libraries = ['mkl_rt']
    library_dirs = ['C:/Users/User/anaconda3\\Library\\lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['C:/Users/User/anaconda3\\Library\\include']
lapack_mkl_info:
    libraries = ['mkl_rt']
    library_dirs = ['C:/Users/User/anaconda3\\Library\\lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['C:/Users/User/anaconda3\\Library\\include']
lapack_opt_info:
    libraries = ['mkl_rt']
    library_dirs = ['C:/Users/User/anaconda3\\Library\\lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['C:/Users/User/anaconda3\\Library\\include']
Supported SIMD extensions in this NumPy install:
    baseline = SSE,SSE2,SSE3
    found = SSSE3,SSE41,POPCNT,SSE42,AVX,F16C,FMA3,AVX2
    not found = AVX512F,AVX512CD,AVX512_SKX,AVX512_CLX,AVX512_CNL

Solution

It doesn't seem really obvious on my machine.

On 1000 runs. I get these timings (x=non transposed, y=transposed). There are more red dots (under the y=x axis) than blue dots. 685/315 to be more accurate. So, p-value wise, no doubt, that cannot be just random effect. (1000 coins drawn, with 685 heads is a clear anomaly)

But timing-wise, it is not obvious. The cluster is mainly centered on the y=x axis.

Now I started this answer because I was pretty sure that this was a cache problem. When I was in engineer school (a very long time ago, when those considerations where even more important they are now, and taught by teachers who, themselves date back from a time when it was even more important), in HPC lessons, we were taught to be very careful when switching from Fortran to C, because of cache effect: when iterating an array, it is very important to interate it in the order it is in memory (which in numpy is still called either "C" order vs "fortran" order, proof that it is still an important consideration for people who care more that I do - I rarely need to care in my everyday job, hence the reason I invoke school memory and not job memory).

Because when dealing with the number that is right next to the one you've just processed before in memory, that number is probably already loaded in cache memory. While if the next number you process is 1 row under (in C order, so further in memory), then, it is more likely that it is not in cache. With nowadays cache size, it takes big matrix so that it makes a difference tho.

Since transpose doesn't move any data, and just adjust the strides, the effect of working on transposed matrix is that you change the order in memory of the processed data. So, if you consider the naive algorithm

for i in range(N):
    for j in range(N):
        res[i,j]=0
        for k in range(N):
            res[i,j] += A[i,k] * B[k,j]

if A and B are in C order, then iteration of matrix A is done in memory order (we iterate along a row, columns by columns, so adjacent number in memory one after the other), while B is not.

If that order is inversed, for example, because they have been transposed, then it is the reverse. It is B that is iterated in the order that won't pose a cache problem and A that is not.

Well, no need to stay too long on this, since I tell all that to explain why I wanted to investigate possibility of a cache problem (my intend was to compare the same multiplication with a copy of a transposed matrix, so that it is the same matrix multiplication, with only order changing. And also to try to see if there is a threshold in matrix size under which the phenomenon is not visible, which would also validate cache problem, since, for this to matter, the whole matrix must not fit in cache)

But, the first step while doing so, is also to start to avoid bias, because first computation use data not yet in cache, while second use data already in cache (especially in the case where the whole matrix fits in cache).

So, here is the first thing I've tried: just inverted computation order. Compute fist on transposed_matrix, and then on matrix.

This time, shifting is in favor of blue dots (and, of course, I've changed only the computation order, not the meaning of axis. So x is still matrix@matrix timing, and y still transposed_matrix

The number of red dots this time is 318 vs 682. So, almost exactly the opposite as before.

So, conclusion (valid at least for my machine): this is indeed a cache problem. But a cache problem caused only by the fact that there is a bias in favor of transposed_matrix: it is already in cache (since the data are the same as the data of matrix), when you use it to compute.

Edit: about question update.

As I said in comment (but since the question was updated, and this answer, already quite upvoted, I think it is important that it also appears in that answer, for future readers), the update is something different.

Your first question was about A@A vs A.T@A.T. The second appeared to be faster. But it was only with 1 single operation. So the reason, as I've shown, was just due to the fact that when second operation is done, A is already in cache memory (which was not the case, when first operation was done). Because A.T is the same data as A (not a copy. But the same data, at same memory address). My previous answer shows that if you reverse, and compute A.T@A.T fist, then A@A, then it is, on the contrary A.T@A.T that is slower, and in the exact same proportion.

Another way to show it is

import numpy as np
import timeit 

A=np.random.normal(0,1,(1000,1000))
B=A.copy()

A@A
print(timeit.timeit(lambda: A@A, number=20))
A.T@A.T
print(timeit.timeit(lambda: A.T@A.T, number=20))
B@B
print(timeit.timeit(lambda: B@B, number=20))

(The fact to perform A@A before timeit, is just to ensure that first of the 20 computations is not slower because of cache consideration)

On my computer, all those operation takes 1 second almost exactly (the number=20 was chosen so that it takes 1 second)

This time, no cache effect, because we run things 21 times each, not counting time at the 1st run. And no influence of .T

Now, for your question update, that is something else

A@A.T
print(timeit.timeit(lambda: A@A.T, number=20))
A.T@A
print(timeit.timeit(lambda: A.T@A, number=20))
A@B.T
print(timeit.timeit(lambda: A@B.T, number=20))

This times, the 1st two operations takes only 650 ms. And not because of cache: it is the same whatever the order of those operations.

That is because numpy is able to detect that A.T and A are the same matrix, but with one transposition operation. (It is quite easy for it to detect so: it is the same data address, but strides and shape (well here shape is square anyway; but more importantly, strides is inverted) are inverted: A.strides → (8000,8), A.T.strides → (8, 8000).

So, it is easy for numpy to realize that this is a A@A.T situation. And therefore to apply an algorithm that computes that faster. As said in comment (and said before I did by others in comments, but also by others, days ago, who misread your first question... but were right to do so, since they answered in advance to what is now the update): A@A.T is symmetrical. So, there are some easy correction here.

Note that

timeit.timeit(lambda: A@B, number=20)
timeit.timeit(lambda: A@B.T, number=20)

are both 1 second (as A@A, and A.T@A.T are). So, it is easy to understand that A@B, A@A, A.T@A.T all just use one standard "matrix multiplication" algorithm. Which A@A.T, A.T@A use a faster one.

Since B is a copy of A, A@B.T has the same symmetrical result as A@A.T. But this times, because it is a copy, numpy cannot realize that it is a A@A.T situation, cannot realize that it is a symmetrical result (before the result is computed). So A@B.T has the same standard "1 second" timing as A@A. While A@A.T has not.

Which confirms that it does rely on the same address, inverted strides criteria. As long as it is either not the same address, or not the same strides inverted, standard algorithm, 1 second. If it is both the same address, but strides inverted, then, special algorithm, 650 ms.