Search code examples
pythonrmahalanobis

Is there a Python equivalent to the mahalanobis() function in R? If not, how can I implement it?


I have the following code in R that calculates the mahalanobis distance on the Iris dataset and returns a numeric vector with 150 values, one for every observation in the dataset.

x=read.csv("Iris Data.csv")
mean<-colMeans(x)
Sx<-cov(x)
D2<-mahalanobis(x,mean,Sx)  

I tried to implement the same in Python using 'scipy.spatial.distance.mahalanobis(u, v, VI)' function, but it seems this function takes only one-dimensional arrays as parameters.


Solution

  • I used the Iris dataset from R, I suppose it is the same you are using.

    First, these is my R benchmark, for comparison:

    x <- read.csv("IrisData.csv")
    x <- x[,c(2,3,4,5)]
    mean<-colMeans(x)
    Sx<-cov(x)
    D2<-mahalanobis(x,mean,Sx)  
    

    Then, in python you can use:

    from scipy.spatial.distance import mahalanobis
    import scipy as sp
    import pandas as pd
    
    x = pd.read_csv('IrisData.csv')
    x = x.ix[:,1:]
    
    Sx = x.cov().values
    Sx = sp.linalg.inv(Sx)
    
    mean = x.mean().values
    
    def mahalanobisR(X,meanCol,IC):
        m = []
        for i in range(X.shape[0]):
            m.append(mahalanobis(X.iloc[i,:],meanCol,IC) ** 2)
        return(m)
    
    mR = mahalanobisR(x,mean,Sx)
    

    I defined a function so you can use it in other sets, (observe I use pandas DataFrames as inputs)

    Comparing results:

    In R

    > D2[c(1,2,3,4,5)]
    
    [1] 2.134468 2.849119 2.081339 2.452382 2.462155
    

    In Python:

    In [43]: mR[0:5]
    Out[45]: 
    [2.1344679233248431,
     2.8491186861585733,
     2.0813386639577991,
     2.4523816316796712,
     2.4621545347140477]
    

    Just be careful that what you get in R is the squared Mahalanobis distance.