Search code examples
rmatrixnumerical-methodseuclidean-distancemahalanobis

Fast distance calculation in R


I'm trying to calculate the

1) Euclidean distance, and

2) Mahalanobis distance

for a set of matricies in r. I've been doing it as such:

v1 <- structure(c(0.508, 0.454, 0, 2.156, 0.468, 0.488, 0.682, 1, 1.832, 
            0.44, 0.928, 0.358, 1, 1.624, 0.484, 0.516, 0.378, 1, 1.512, 
            0.514, 0.492, 0.344, 0, 1.424, 0.508, 0.56, 0.36, 1, 1.384, 0.776, 
            1.888, 0.388, 0, 1.464, 0.952, 0.252, 0.498, 1, 1.484, 0.594, 
            0.256, 0.54, 2, 2.144, 0.402, 0.656, 2.202, 1, 1.696, 0.252), 
          .Dim = c(5L, 10L), 
          .Dimnames = list(NULL, c("KW_1", "KW_2", "KW_3", "KW_4", "KW_5", "KW_6", "KW_7", "KW_8", "KW_9", "KW_10")))

v2 <- structure(c(1.864, 1.864, 1.864, 1.864, 1.864, 1.6, 1.6, 1.6, 
            1.6, 1.6, 1.536, 1.536, 1.536, 1.536, 1.536, 1.384, 1.384, 1.384, 
            1.384, 1.384, 6.368, 6.368, 6.368, 6.368, 6.368, 2.792, 2.792, 
            2.792, 2.792, 2.792, 2.352, 2.352, 2.352, 2.352, 2.352, 2.624, 
            2.624, 2.624, 2.624, 2.624, 1.256, 1.256, 1.256, 1.256, 1.256, 
            1.224, 1.224, 1.224, 1.224, 1.224), 
          .Dim = c(5L, 10L), 
          .Dimnames = list(NULL, c("KW_1", "KW_2", "KW_3", "KW_4", "KW_5", "KW_6", "KW_7", "KW_8", "KW_9", "KW_10")))

L2 <- sqrt(rowSums((v1-v2)^2))  # Euclidean distance for each row

which provides:

[1] 7.132452 7.568359 7.536904 5.448696 7.163580

That's perfect! But I've heard you can also compute Euclidean/L2 distance using the following form:

enter image description here

I'd like to calculate my distance this way because the Mahalanobis distance is simply this and the covariance matrix. See this.

I haven't figured out how to code this in r, however. I've tried:

sqrt(crossprod((t(v1)-t(v2))))

and

sqrt((v1-v2) %*% t(v1-v2))

But they just don't give me what I want. Suggestions?

Note - I'm looking to do this as a single operation, not in a loop of any kind. It has to be very fast because I'm doing it over millions of rows multiple times. Maybe it's not possible. I'm open to changing the format of v1 and v2.


Solution

  • You need to apply the formula to each row individually, so something like:

    > sapply(1:nrow(v1), function(i) {
    +     q = v1[i, ] - v2[i, ]
    +     d = sqrt(t(q) %*% q)
    +     d
    + })
    [1] 7.132452 7.568359 7.536904 5.448696 7.163580
    

    If you need something faster you can always try the same thing in C++ (code adapted from here):

    #include <Rcpp.h>
    
    using namespace Rcpp;
    
    double dist2(NumericVector x, NumericVector y){
        double d = sqrt( sum( pow(x - y, 2) ) );
        return d;
    }
    
    // [[Rcpp::export]]
    NumericVector calc_l2 (NumericMatrix x, NumericMatrix y){
        int out_length = x.nrow();
        NumericVector out(out_length);
    
        for (int i = 0 ; i < out_length; i++){
            NumericVector v1 = x.row(i);
            NumericVector v2 = y.row(i);
            double d = dist2(v1, v2);
            out(i) = d;
        }
        return (out) ;
    }
    

    Running in R:

    library(Rcpp)
    
    sourceCpp("calc_L2.cpp")
    calc_l2(v1, v2)