transpose 3D array and multiply elementwise-memory contiguity

I have a huge 3D array that looks like A.shape = (100000, 5000, 50). I need to transpose it to have an array of the form A.shape = (50, 5000, 100000). Then I need to do the operation a = a.T @ a on each of the 50 matrices contained in A. This gives me a 3D array of the form A.shape = (50, 5000, 5000).

If I do this with A.transpose(2, 1, 0) @ A.transpose(2, 0, 1) the single matrix multiplications a = a.T @ a turn out to be a thousand times slower than the case where a were not extracted from A.

The problem is that after transposing, the 3D array is not contiguous. I tried use np.ascontiguousarray() or copy() after transposing. It improves but it is still slower and it spends quite some time for copying.

Could any one suggest a better choice ? In particular I am trying to use np.einsum but I could not.

Solution

You can try the following:

A = ...
b = np.einsum('jki,jli->ikl', A, A)
print(b.shape)
# (50, 5000, 5000)