Numpy optimisation - C and F_Contiguous Matrices

I got intrigued by the discussion in http://scipy.github.io/old-wiki/pages/PerformanceTips on how to get faster dot computations.

It is concluded dotting C_contiguous matrices should be faster, and the following results are presented

import numpy as np
from time import time
N = 1000000
n = 40
A = np.ones((N,n))

AT_F = np.ones((n,N), order='F')
AT_C = np.ones((n,N), order='C')
>>> t = time();C = np.dot(A.T, A);t1 = time() - t
3.9203271865844727
>>> t = time();C = np.dot(AT_F, A);t2 = time() - t
3.9461679458618164
>>> t = time();C = np.dot(AT_C, A);t3 = time() - t
2.4167969226837158

I tried it as well (Python 3.7) and the final computation, using C_contiguous matrices, is not faster at all!

I get the following results

 >>> t1
 0.2102820873260498
 >>> t2
 0.4134488105773926
 >>> t3
 0.28309035301208496

It turns out the first approach is the fastest.

Where is this discrepancy between their and mine calculations coming from? How can transposing in the first case not slow the calculation down?

Thanks

Solution

My linux/timeit times:

In [122]: timeit A.T@A                                                          
258 ms ± 523 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [123]: timeit AT_F@A                                                         
402 ms ± 2.66 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [124]: timeit AT_C@A                                                         
392 ms ± 9.16 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [125]: %%timeit x=A.T.copy(order='F') 
     ...: x@A                                                                       
410 ms ± 18.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)