r normal-distribution python matrix numpy

Vectorizing code to calculate (squared) Mahalanobis Distiance

EDIT 2: this post seems to have been moved from CrossValidated to StackOverflow due to it being mostly about programming, but that means by fancy MathJax doesn't work anymore. Hopefully this is still readable.

Say I want to to calculate the squared Mahalanobis distance between two vectors x and y with covariance matrix S. This is a fairly simple function defined by

M2(x, y; S) = (x - y)^T * S^-1 * (x - y)

With python's numpy package I can do this as

# x, y = numpy.ndarray of shape (n,)
# s_inv = numpy.ndarray of shape (n, n)
diff = x - y
d2 = diff.T.dot(s_inv).dot(diff)

or in R as

diff <- x - y
d2 <- t(diff) %*% s_inv %*% diff

In my case, though, I am given

m by n matrix X
n-dimensional vector mu
n by n covariance matrix S

and want to find the m-dimensional vector d such that

d_i = M2(x_i, mu; S)  ( i = 1 .. m )

where x_i is the ith row of X.

This is not difficult to accomplish using a simple loop in python:

d = numpy.zeros((m,))
for i in range(m):
    diff = x[i,:] - mu
    d[i] = diff.T.dot(s_inv).dot(diff)

Of course, given that the outer loop is happening in python instead of in native code in the numpy library means it's not as fast as it could be. $n$ and $m$ are about 3-4 and several hundred thousand respectively and I'm doing this somewhat often in an interactive program so a speedup would be very useful.

Mathematically, the only way I've been able to formulate this using basic matrix operations is

d = diag( X' * S^-1 * X'^T )

where

 x'_i = x_i - mu

which is simple to write a vectorized version of, but this is unfortunately outweighed by the inefficiency of calculating a 10-billion-plus element matrix and only taking the diagonal... I believe this operation should be easily expressible using Einstein notation, and thus could hopefully be evaluated quickly with numpy's einsum function, but I haven't even begun to figure out how that black magic works.

So, I would like to know: is there either a nicer way to formulate this operation mathematically (in terms of simple matrix operations), or could someone suggest some nice vectorized (python or R) code that does this efficiently?

BONUS QUESTION, for the brave

I don't actually want to do this once, I want to do it k ~ 100 times. Given:

m by n matrix X
k by n matrix U
Set of n by n covariance matrices each denoted S_j (j = 1..k)

Find the m by k matrix D such that

D_i,j = M(x_i, u_j; S_j)

Where i = 1..m, j = 1..k, x_i is the ith row of X and u_j is the jth row of U.

I.e., vectorize the following code:

# s_inv is (k x n x n) array containing "stacked" inverses
# of covariance matrices
d = numpy.zeros( (m, k) )
for j in range(k):
    for i in range(m):
        diff = x[i, :] - u[j, :]
        d[i, j] = diff.T.dot(s_inv[j, :, :]).dot(diff)

Solution

First off, it seems like maybe you're getting S and then inverting it. You shouldn't do that; it's slow and numerically inaccurate. Instead, you should get the Cholesky factor L of S so that S = L L^T; then

M^2(x, y; L L^T)
  = (x - y)^T (L L^T)^-1 (x - y)
  = (x - y)^T L^-T L^-1 (x - y)
  = || L^-1 (x - y) ||^2,

and since L is triangular L^-1 (x - y) can be computed efficiently.

As it turns out, scipy.linalg.solve_triangular will happily do a bunch of these at once if you reshape it properly:

L = np.linalg.cholesky(S)
y = scipy.linalg.solve_triangular(L, (X - mu[np.newaxis]).T, lower=True)
d = np.einsum('ij,ij->j', y, y)

Breaking that down a bit, y[i, j] is the ith component of L^-1 (X_j - \mu). The einsum call then does

d_j = \sum_i y_{ij} y_{ij}
    = \sum_i y_{ij}^2
    = || y_j ||^2,

like we need.

Unfortunately, solve_triangular won't vectorize across its first argument, so you should probably just loop there. If k is only about 100, that's not going to be a significant issue.

If you are actually given S^-1 rather than S, then you can indeed do this with einsum more directly. Since S is quite small in your case, it's also possible that actually inverting the matrix and then doing this would be faster. As soon as n is a nontrivial size, though, you're throwing away a lot of numerical accuracy by doing this.

To figure out what to do with einsum, write everything in terms of components. I'll go straight to the bonus case, writing S_j^-1 = T_j for notational convenience:

D_{ij} = M^2(x_i, u_j; S_j)
  = (x_i - u_j)^T T_j (x_i - u_j)
  = \sum_k (x_i - u_j)_k ( T_j (x_i - u_j) )_k
  = \sum_k (x_i - u_j)_k \sum_l (T_j)_{k l} (x_i - u_j)_l
  = \sum_{k l} (X_{i k} - U_{j k}) (T_j)_{k l} (X_{i l} - U_{j l})

So, if we make arrays X of shape (m, n), U of shape (k, n), and T of shape (k, n, n), then we can write this as

diff = X[np.newaxis, :, :] - U[:, np.newaxis, :]
D = np.einsum('jik,jkl,jil->ij', diff, T, diff)

where diff[j, i, k] = X_[i, k] - U[j, k].