Fastest Way to Find the Dot Product of a Large Matrix of Vectors

I am looking for suggestions on the most efficient way to solve the following problem:

I have two arrays called A and B. They are both of shape NxNx3. They represent two 2D matrix of positions, where each position is a vector of x, y, and z coordinates.

I want to create a new array, called C, of shape NxN, where C[i, j] is the dot product of the vectors A[i, j] and B[i, j].

Here are the solutions I've come up with so far. The first uses the numpy's einsum function (which is beautifully described here). The second uses numpy's broadcasting rules along with its sum function.

>>> import numpy as np
>>> A = np.random.randint(0, 10, (100, 100, 3))
>>> B = np.random.randint(0, 10, (100, 100, 3))
>>> C = np.einsum("ijk,ijk->ij", A, B)
>>> D = np.sum(A * B, axis=2)
>>> np.allclose(C, D)
True

Is there a faster way? I've heard murmurs that numpy's tensordot function can be blazing fast but I've always struggled to understand it. What about using numpy's dot, or inner functions?

For some context, the A and B arrays will typically have between 100 and 1000 elements.

Any guidance is much appreciated!

Solution

With a bit of reshaping, we can use matmul. The idea is to treat the first 2 dimensions as the 'batch' dimensions, and to the dot on the last:

In [278]: E = A[...,None,:]@B[...,:,None]                                       
In [279]: E.shape                                                               
Out[279]: (100, 100, 1, 1)
In [280]: E = np.squeeze(A[...,None,:]@B[...,:,None])                           
In [281]: np.allclose(C,E)                                                      
Out[281]: True
In [282]: timeit E = np.squeeze(A[...,None,:]@B[...,:,None])                    
130 µs ± 2.01 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [283]: timeit C = np.einsum("ijk,ijk->ij", A, B)                             
90.2 µs ± 1.53 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Comparative timings can be a bit tricky. In the current versions, einsum can take different routes depending on the dimensions. In some cases it appears to delegate the task to matmul (or at least the same underlying BLAS-like code). While it's nice that einsum is faster in this test, I wouldn't generalize that.

tensordot just reshapes (and if needed transposes) the arrays so it can apply the ordinary 2d np.dot. Actually it doesn't work here because you are treating the first 2 axes as a 'batch', where as it does an outer product on them.