Multiplying matrices across tenor axis with numpy and with GPU

I have a matrix X with shape (F,T,M). I wish to multiply each (T,M) matrix along the F axis so that the answer will be of shape (M,M,F). This code does the job for me but this operation repeats many times and it is very slow:

    for f in range(F):
        output[:,:,f] = np.matmul(X[f,:,:].T,X[f,:,:])

All I could find is np.tensordot() function. If I understand correctly, this is not a good option for me since I need a matrix multiplication and not a dot product.

How do I implement this efficiently using numpy? Is it possible and beneficial to utilize keras\tf for this purpose?

Solution

We can use np.matmul/@ opeartor in Python 3.x after extending dimensions -

np.matmul(X.swapaxes(1,2),X).swapaxes(0,2)
(X.swapaxes(1,2)@X).swapaxes(0,2)

Alternatively, with np.einsum with a direct translation off the shape variables used for the string notation -

np.einsum('ftm,ftn->mnf',X,X)

Speedwise, how does Pharo compare to Python3?
size of a struct in Go without initialising a variable of struct
Performance Optimization for Minimax Algorithm in Tic-Tac-Toe with Variable Board Sizes and Win Conditions
Performance issue using HPX for parallelization in C++ code
Parallel streams slower than sequential streams when generating many random numbers
Substring Without any Allocation Using Span<T>
In the C if-else statement, should the condition which is more likely to be true come first?
SQL PERFORMANCE - alternative to OR statement
GitHub REST API: Get issue based only on title
How to accelerate the cross-correlation computation of two 2D matrices in Python?
Parse JSON efficiently
ADB stopping at <waiting for devices>
How to increase the performance of a Database?
std::find_if vs manual loop, which one is better?
Redundant/Better Performance Code VS Optimized/Less Performance Code
A Faster way of Directory walking instead of os.listdir?
How to prevent automatic handling of newlines killing performance in Perl text output?
Efficiently count lists with certain properties
How to explain pandas higher performances compared to numpy with 500k+ rows?
Series of if vs else-if statements benchmarking
Will the C++20 spaceship operator be used for equality/inequality comparisons?
When you have an AMD CPU, can you speed up code that uses the Intel-MKL?
Debug vs. Release performance
C Program Runs Surprisingly Slow
ConcurrentDictionary.AddOrUpdate method in C# how efficient is it?
Generating all permutations efficiently
How to efficiently count the number of keys/properties of an object in JavaScript
Optimizing The Exact Prime Number Theorem
Does a mutating struct function create a new copy of self?
Best Practices for Multiple OnEdit Functions