Search code examples
pythonnumpymatrixmahalanobis

Vectorizing (squared) mahalanobis distance in numpy


I have X (n x d), Y (m x d), and positive-definite L (d x d). I want to calculate D where D_ij is (X_i - Y_i) * L * (X_i - Y_i).T. n and m are around 250; d is around 10^4.

I can use scipy.spatial.distance.cdist, but this is very slow.

scipy.spatial.distance.cdist(X, Y, metric='mahalanobis', VI=L)

Looking at Dougal's answer to this question, I tried

    diff = X[np.newaxis, :, :] - Y[:, np.newaxis, :]
    D = np.einsum('jik,kl,jil->ij', diff, L, diff)

Which is also very slow.

Is there a more efficient way to vectorize this computation?


Solution

  • Using a combination of np.tensordot and np.einsum helps in situations like these -

    np.einsum('jil,jil->ij',np.tensordot(diff, L, axes=(2,0)), diff)
    

    Runtime test -

    In [26]: n,m,d = 30,40,50
        ...: X = np.random.rand(n,d)
        ...: L = np.random.rand(d,d)
        ...: Y = np.random.rand(m,d)
        ...: 
    
    In [27]: diff = X[np.newaxis, :, :] - Y[:, np.newaxis, :]
    
    In [28]: %timeit np.einsum('jik,kl,jil->ij', diff, L, diff)
    100 loops, best of 3: 7.81 ms per loop
    
    In [29]: %timeit np.einsum('jil,jil->ij',np.tensordot(diff, L, axes=(2,0)), diff)
    1000 loops, best of 3: 472 µs per loop