I'm calculating a covariance matrix from a 2D array using np.cov, and using it to get nearest neighbors with Mahalanobis distance.
c = np.cov(arr)
neigh = NearestNeighbors(100,metric='mahalanobis',metric_params = {'VI':np.linalg.inv(c)})
neigh.fit(dfeatures)
But for some reason, I'm getting
/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py:131: RuntimeWarning: invalid value encountered in sqrt
and the values of the distance of any query point returns nan.
Instead of passing c to NearestNeighbors, if I pass an identity matrix the NearestNeighbors works as expected. I suspected that c might actually not be positive semidefinite and therefore the values in the sqrt in Mahalanobis distance might get a negative value as input.
I checked the eigenvalue of resulting c and many of them turned out to be negative(and complex) but close to 0.
I'd a few questions:
Turns out this is in-fact because of numerical error. A workaround to correct this is to add a small number to diagonal element of covariance matrix. The larger this number the closer the distance will be to euclidean distance, so one must be careful while choosing this number.