python numpy statistics scipy mahalanobis

Scipy - Nan when calculating Mahalanobis distance

When I try to calculate the Mahalanobis distance with the following python code I get some Nan entries in the result. Do you have any insight about why this happens? My data.shape = (181, 1500)

from scipy.spatial.distance import pdist, squareform

data_log = log2(data + 1) # A log transform that I usually apply to my data
data_centered = data_log - data_log.mean(0) # zero centering
D = squareform( pdist(data_centered, 'mahalanobis' ) )

I also tried:

data_standard = data_centered / data_centered.std(0, ddof=1)
D = squareform( pdist(data_standard, 'mahalanobis' ) )

Also got nans. The input is not corrupted and other distances, such as correlation distance, can be computed just fine. For some reason when I reduce the number of features I stop getting Nans. E.g the following examples does not get any Nan:

D = squareform( pdist(data_centered[:,:200], 'mahalanobis' ) )
D = squareform( pdist(data_centered[:,180:480], 'mahalanobis' ) )

while those others get Nans:

D = squareform( pdist(data_centered[:,:300], 'mahalanobis' ) )
D = squareform( pdist(data_centered[:,180:600], 'mahalanobis' ) )

Any clue? Is this an expected behaviour if some condition for the input is not satisfied?

Solution

You have fewer observations than features, so the covariance matrix V computed by the scipy code is singular. The code doesn't check this, and blindly computes the "inverse" of the covariance matrix. Because this numerically computed inverse is basically garbage, the product (x-y)*inv(V)*(x-y) (where x and y are observations) might turn out to be negative. Then the square root of that value results in nan.

For example, this array also results in a nan:

In [265]: x
Out[265]: 
array([[-1. ,  0.5,  1. ,  2. ,  2. ],
       [ 2. ,  1. ,  2.5, -1.5,  1. ],
       [ 1.5, -0.5,  1. ,  2. ,  2.5]])

In [266]: squareform(pdist(x, 'mahalanobis'))
Out[266]: 
array([[ 0.        ,         nan,  1.90394328],
       [        nan,  0.        ,         nan],
       [ 1.90394328,         nan,  0.        ]])

Here's the Mahalanobis calculation done "by hand":

In [279]: V = np.cov(x.T)

In theory, V is singular; the following value is effectively 0:

In [280]: np.linalg.det(V)
Out[280]: -2.968550671342364e-47

But inv doesn't see the problem, and returns an inverse:

In [281]: VI = np.linalg.inv(V)

Let's compute the distance between x[0] and x[2] and verify that we get the same non-nan value (1.9039) returned by pdist when we use VI:

In [295]: delta = x[0] - x[2]

In [296]: np.dot(np.dot(delta, VI), delta)
Out[296]: 3.625

In [297]: np.sqrt(np.dot(np.dot(delta, VI), delta))
Out[297]: 1.9039432764659772

Here's what happens when we try to compute the distance between x[0] and x[1]:

In [300]: delta = x[0] - x[1]

In [301]: np.dot(np.dot(delta, VI), delta)
Out[301]: -1.75

Then the square root of that value gives nan.

In scipy 0.16 (to be released in June 2015), you will get an error instead of nan or garbage. The error message describes the problem:

In [4]: x = array([[-1. ,  0.5,  1. ,  2. ,  2. ],
   ...:        [ 2. ,  1. ,  2.5, -1.5,  1. ],
   ...:        [ 1.5, -0.5,  1. ,  2. ,  2.5]])

In [5]: pdist(x, 'mahalanobis')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-5-a3453ff6fe48> in <module>()
----> 1 pdist(x, 'mahalanobis')

/Users/warren/local_scipy/lib/python2.7/site-packages/scipy/spatial/distance.pyc in pdist(X, metric, p, w, V, VI)
   1298                                      "singular. For observations with %d "
   1299                                      "dimensions, at least %d observations "
-> 1300                                      "are required." % (m, n, n + 1))
   1301                 V = np.atleast_2d(np.cov(X.T))
   1302                 VI = _convert_to_double(np.linalg.inv(V).T.copy())

ValueError: The number of observations (3) is too small; the covariance matrix is singular. For observations with 5 dimensions, at least 6 observations are required.