How to perform fast tensor contraction with two tensors and a vector?

I'm using numpy (ideally Numba) to perform a tensor contraction that involves three tensors, one of which is a vector that should multiply only one index of the others. For example,

A = np.random.normal(size=(20,20,20,20))
B = np.random.normal(size=(20,20,20,20))
v = np.sqrt(np.arange(20))

# e.g. v on the 3rd index
>>> %timeit np.vdot(A * v[None, None, :, None], B)
125 µs ± 5.14 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

compare with

C = np.random.normal(size=(20,20,20,20))

>>> %timeit np.vdot(A * C, B)
76.8 µs ± 4.25 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

Is there a more efficient way of including the product with v? It feels wrong that it should be slower than multiplying by the full tensor C.

Solution

I could squeeze some performance by using numba with parallel=True

import numba as nb
import numpy as np

N = 50

@nb.njit('float64(float64[:,:,:,:], float64[:,:,:,:],float64[:])',parallel=True)
def dotABv(a, b,vv):
    res = 0.0
    for i in nb.prange(a.shape[0]):
        for j in range(a.shape[1]):
            for k in range(a.shape[2]):
                res += vv[k]*np.dot(a[i,j,k,:],b[i,j,k,:])
    return res

v = np.sqrt(np.arange(N))
A = np.random.normal(size=(N,N,N,N))
B = np.random.normal(size=(N,N,N,N))
C = np.random.normal(size=(N,N,N,N))

%timeit dotABv(A,B,v)
%timeit np.vdot(A , B) ## just to compare with dot
%timeit np.vdot(A * v[None, None, :, None], B)
# Output :
# 501 µs ± 1.42 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
# 1.1 ms ± 111 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
# 14.7 ms ± 239 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

print(dotABv(A,B,v), np.vdot(A * v[None, None, :, None], B))
# 5105.504508154087 5105.5045081541075