python scikit-learn nearest-neighbor mahalanobis

Nearest neighbors with uncertain points

I have two 2D points sets A and B. I want to find the first nearest neighbor in A for each point in B. However, I am dealing with uncertain points (i.e. a point has a mean (2D vector) and a 2*2 covariance matrix).

I thus would like to use the Mahalanobis distance, but in scikit-learn (for example), I cannot pass a covariance matrix for each point, as it expects a single covariance matrix.

Currently, considering only the average locations (i.e. mean of my 2D normal distribution), I have:

nearest_neighbors = NearestNeighbors(n_neighbors=1, metric='l2').fit(A)
distance, indices = nearest_neighbors.kneighbors(B)

With my uncertain points, instead of using the L2 norm as a distance, I would rather compute (between a point a in A and a point b in B, their Mahalanobis distance:

d(a, b) = sqrt( transpose(mu_a-mu_b) * C * (mu_a-mu_b))

where C = inv(cov_a + cov_b)

where mu_a (resp mu_b) and cov_a (resp. cov_b) are the 2D mean and 2*2 covariance matrix of uncertain point a (resp. b).

Solution

I ended up using a custom distance:

def my_mahalanobis_distance(x, y):
    '''
    x: array of shape (4,) x[0]: mu_x_1, x[1]: mu_x_2, 
                            x[2]: cov_x_11, x[3]: cov_x_22
    y: array of shape (4,) y[0]: mu_ y_1, y[1]: mu_y_2,
                            y[2]: cov_y_11, y[3]: cov_y_22 
    '''     



    return sp.spatial.distance.mahalanobis(x[:2], y[:2], 
                                           np.linalg.inv(np.diag(x[2:]) 
                                           + np.diag(y[2:])))

Thus a point has 4 features:

x and y coordinates
x and y variances (covariance matrix is diagonal in my case)