Search code examples
pythonpcamahalanobis

Mahalanobis distance not equal to Euclidean distance after PCA


I am trying to compute the Mahalanobis distance as the Euclidean distance after transformation with PCA, however, I do not get the same results. The following code:

import numpy as np
from scipy.spatial.distance import mahalanobis
from sklearn.decomposition import PCA

X = [[1,2], [2,2], [3,3]]

mean = np.mean(X, axis=0)
cov = np.cov(X, rowvar=False)
covI = np.linalg.inv(cov)

maha = mahalanobis(X[0], mean, covI)
print(maha)

pca = PCA()

X_transformed = pca.fit_transform(X)

stdev = np.std(X_transformed, axis=0)
X_transformed /= stdev

print(np.linalg.norm(X_transformed[0]))

prints

1.1547005383792515
1.4142135623730945

To my understanding, the PCA uncorrelates the dimensions, and the division by the standard deviation weights every dimension equally, so the Euclidean distance should equal the Mahalanobis distance. Where am I going wrong?


Solution

  • According to this discussion, the relationship between PCA and the Mahalanobis distance only holds true with PCA components with unit variance. This can be obtained by applying PCA on the whitened data (more information here).

    Once you do that, the Mahalanobis distance in the original space is equal to the euclidean distance in the PCA space. You can see a demonstration of that in the code below:

    import numpy as np
    from scipy.spatial.distance import mahalanobis,euclidean
    from sklearn.decomposition import PCA
    from sklearn.preprocessing import StandardScaler
    
    X = np.array([[1,2], [2,2], [3,3]])
    
    cov = np.cov(X, rowvar=False)
    covI = np.linalg.inv(cov)
    mean=np.mean(X)
    maha = mahalanobis(X[0], X[1], covI)
    
    pca = PCA(whiten=True)
    X_transformed= pca.fit_transform(X)
    
    print('Mahalanobis distance: '+str(maha))
    print('Euclidean distance: '+str(euclidean(X_transformed[0],X_transformed[1])))
    

    The output gives:

    Mahalanobis distance: 2.0
    Euclidean distance: 2.0000000000000004