Making numpy einsum faster for multidimensional tensors

I have some code that uses the following einsum:

y = np.einsum('wxyijk,ijkd->wxyd', x, f)

where (for example) the shape of x is (64, 26, 26, 3, 3, 3) and the shape of f is (3, 3, 3, 1), both having dtype=float

%timeit np.einsum('wxyijk,ijkd->wxyd', x, f)
# 2.01 ms ± 55.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

This is too slow for my application, which is time critical. Neither using the GPU (via CuPy) nor path speedups (via opt-einsum) seems to make this any faster. Is there any way to make it faster natively in NumPy, or is this just about as fast as it's going to get?

Solution

You could use in this case the optimize keyword, implement it yourself, or use tensordot. All but, the first version should actually do the same thing (reshaping-> dot -> reshaping).

Your implementation

x=np.random.rand(64, 26, 26, 3, 3, 3)
f=np.random.rand(3, 3, 3, 1)
%timeit y = np.einsum('wxyijk,ijkd->wxyd', x, f)
#886 µs ± 3.16 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

With optimize="optimal"

%timeit y = np.einsum('wxyijk,ijkd->wxyd', x, f,optimize="optimal")
#275 µs ± 23.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Reshaping and BLAS-call

This normally leads to a quite comparable performance to optimize="optimal", maybe in this case a unnecessary array copy leads to the slowdown.

def contract(x,f):
    s1=x.shape
    x_=x.reshape(s1[0]*s1[1]*s1[2],s1[3]*s1[4]*s1[5])

    s2=f.shape
    f_=f.reshape(s2[0]*s2[1]*s2[2],s2[3])
    return np.dot(x_,f_).reshape(s1[0],s1[1],s1[2],s2[3])

%timeit contract(x,f)
#144 µs ± 3.09 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Tensordot

%timeit np.tensordot(x,f,axes=3)
#176 µs ± 4.32 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)