Search code examples
numpyvectorizationmahalanobis

Speeding up Mahalanobis distance calculation


Background:

I am implementing a Sequential Backward Selection algorithm to select features from a dataset. The dataset in question is MNIST. I have 60000 vectors of length 784.

The algorithm requires me to leave out one feature, fi from the total of 784 and select the remaining 783 features, called selection in the below code. I then must compute the Mahalanobis of each vector to it's respect classes' mean. Once this iteration is completed, I leave out two features and then three and so on. Each of these iterations take 3 minutes.

I have to select 500 features so the above is repeated 500 times, so in total the Mahalanobis distance is computed 500 x 784 = 392,000 times. This requires me to compute the inverse of the covariance matrix. The inverse of this covariance matrix does not exist as it is singular so I am using numpy's Pseudo-Inverse.

Problem

As you can imagine the above is extremely slow. Computing the Pseudo-Inverse is a slowest process. I thought I could get away with precomputing the Pseudo-Inverse and then deleting the corresponding columns and rows that are associated with fi. However, as it turns out this Pseudo-Inverse matrix is not equal to the Pseudo-Inverse matrix computed directly from vectors where I delete fi already.

What I've tried

I have tried vectoring this to a large extent and processing stacks of arrays only to find that the factorized approach was slower. I have tried np.einsum, cdist and even numexpr. Nothing really helps.

This leads me to believe the best chance I have at speeding this up is somehow moving the covariance and Pseudo-Inverse calculation out of this loop. This is my current code:

def mahalanobis(self, data, lbls, selection):
    subset data[:,tuple(selection)]

    for n in range(10):
        class_rows = subset[np.where(y == n)]
        mean = np.mean(class_rows, axis = )
        pseudoInverse = pinv(covariance(class_rows))
        delta = C - u
        d[n] = np.mean(np.sum(((delta @ pseudoInverse) * delta), axis = -1))
    return np.mean(d)

Question

How can I speed this computation up? From the tests I've done in the past week, it seems the slowest part of this computation is the line pseudoInverse = pinv(covariance(class_rows)).


Solution

  • Right now, your code is essentially:

    def mahalanobis(delta, cov):
        ci = np.linalg.pinv(cov)
        return np.sum(((delta @ ci) * delta), axis=-1)
    

    You can speed this up a little by:

    • Using svd directly instead of pinv, and eliminating the conjugations that you're not using.
    • Using eigh instead of svd, which exploits the symmetry of the covariance matrix
    def mahalanobis_eigh(delta, cov):
        s, u = np.linalg.eigh(cov)
        # note: missing filtering of small s, which you might want to consider adding - pinv already does this for you
        ci = u @ (1/s[...,None] * u.T)
        return np.sum(((delta @ ci) * delta), axis=-1)
    

    It's worth noting that neither this function nor your function work correctly for complex values.