Search code examples
rexcelmahalanobis

Simple example calculating Mahalanobis distance between two groups in R


I'm trying to reproduce this example using Excel to calculate the Mahalanobis distance between two groups.

Plot of data from example

To my mind the example provides a good explanation of the concept. However, I'm not able to reproduce in R.

The result obtained in the example using Excel is Mahalanobis(g1, g2) = 1.4104.

Following the answer given here for R and apply it to the data above as follows:

# dataset used in the Excel example
g1 <- matrix(c(2, 2, 2, 5, 6, 5, 7, 3, 4, 7, 6, 4, 5, 3, 4, 6, 2, 5, 1, 3), ncol = 2, byrow = TRUE)
g2 <- matrix(c(6, 5, 7, 4, 8, 7, 5, 6, 5, 4), ncol = 2, byrow = TRUE)

# function adopted from R example
D.sq <- function (g1, g2) {
    dbar <- as.vector(colMeans(g1) - colMeans(g2))
    S1 <- cov(g1)
    S2 <- cov(g2)
    n1 <- nrow(g1)
    n2 <- nrow(g2)
    V <- as.matrix((1/(n1 + n2 - 2)) * (((n1 - 1) * S1) + ((n2 - 1) * S2)))
    D.sq <- t(dbar) %*% solve(V) %*% dbar
    res <- list()
    res$D.sq <- D.sq
    res$V <- V
    res
}

D.sq(g1,g2)

and executing the function on the data returns the following output:

$D.sq
         [,1]
[1,] 1.724041

$V
          [,1]      [,2]
[1,] 3.5153846 0.3153846
[2,] 0.3153846 2.2230769

Afaik $D.sq represents the distance and 1.724 is significantly different to the 1.4101 result from the Excel example. As I'm new to the concept of the Mahalanobis distance I was wondering if I did something wrong and/or there's a better way to calculate this e.g. using mahalanobis()?


Solution

  • The reasons why do you get different result are

    • The Excel algorithm is actually different to the R algorithm in how you calculate the pooled covariance matrix, the R version gives you the result of unbiased estimate of covariance matrix, while the Excel version gives you the MLE estimate. In the R version, you calculate the matrix like: ((n1 - 1) * cov(g1) + (n2 - 1) * cov(g2)) / (n1 + n2 - 2); while in Excel version: ((n1 - 1) * cov(g1) + (n2 - 1) * cov(g2)) / (n1 + n2).

    • The last calculation step in the Excel post you refer to is incorrect, the result should be 1.989278 instead.

    Edit:

    The unbiased estimator for pooled covariance matrix is the standard way, as is in the Wikipedia page: https://en.wikipedia.org/wiki/Pooled_variance . A related fact is that in R, when you use cov or var, you get an unbiased estimator instead of MLE estimator for covariance matrix.

    Edit2: The mahalanobis function in R calculates the mahalanobis distance from points to a distribution. It does not calculate the mahalanobis distance of two samples.

    Conclusion: In sum, the most standard way to calculate mahalanobis distance between two samples is the R code in the original post, which uses the unbiased estimator of pooled covariance matrix.