Search code examples
python-3.xnumpyscikit-learncovariance

numpy.cov returning a matrix which is not positive semi-definite


I'm calculating a covariance matrix from a 2D array using np.cov, and using it to get nearest neighbors with Mahalanobis distance.

c = np.cov(arr)
neigh = NearestNeighbors(100,metric='mahalanobis',metric_params = {'VI':np.linalg.inv(c)})
neigh.fit(dfeatures)

But for some reason, I'm getting

/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py:131: RuntimeWarning: invalid value encountered in sqrt

and the values of the distance of any query point returns nan.

Instead of passing c to NearestNeighbors, if I pass an identity matrix the NearestNeighbors works as expected. I suspected that c might actually not be positive semidefinite and therefore the values in the sqrt in Mahalanobis distance might get a negative value as input.

I checked the eigenvalue of resulting c and many of them turned out to be negative(and complex) but close to 0.

I'd a few questions:

  • Is this totally because of the numerical errors(or am I doing something wrong)?
  • If it is because of numerical errors is there a way to fix it?

Solution

  • Turns out this is in-fact because of numerical error. A workaround to correct this is to add a small number to diagonal element of covariance matrix. The larger this number the closer the distance will be to euclidean distance, so one must be careful while choosing this number.