Search code examples
rperformancesimilarityr-daisy

Speed of Daisy Function


I'm working on improving the speed of a function (for a dissimilarity measure) I'm writing which is quite similar mathematically to the Euclidean distance function. However, when I time my function compared to that implemented in the daisy function from the cluster package, I find quite a significant difference in speed, with daisy performing much better. Given that (I'm assuming) a dissimilarity measure would require O(n x p) time due to the need to compare each object to itself over all variables (where n is number of objects and p is number of variables), I find it difficult to understand how the daisy function performs so well (near constant time, from the few experiments I've done) relative to my simple and direct implementation. I present the code I have used both to implement and test below. I have tried looking through the r source code for the implementation of the daisy function, but I found it difficult to understand. I found no nested for loop. Any help with understanding why this function performs so fast and how I could possibly modify my code to have similar speed would be very highly appreciated.

euclidean <- function (df){
  
  no_obj <- nrow(df)
  
  dist <- array(0, dim = c(no_obj, no_obj))
  
  for (i in 1:no_obj){
    for (j in 1:no_obj){
      dist_v <- 0
      if(i != j){
        for (v in 1:ncol(df)){
          dist_v <- dist_v + sqrt((df[i,v] - df[j,v])^2)
        }
      }
      dist[i,j] <- dist_v
    }
  }
  return(dist)
}

data("iris")

tic <- Sys.time()
dst <- euclidean(iris[,1:4])
time <- difftime(Sys.time(), tic, units = "secs")[[1]]
print(paste("Time taken [Euclidean]: ", time))

tic <- Sys.time()
dst <- daisy(iris[,1:4])
time <- difftime(Sys.time(), tic, units = "secs")[[1]]
print(paste("Time taken [Daisy]: ", time))

Solution

  • one option:

     euclidean3 <- function(df) {
      require(data.table)
      n <- nrow(df)
      i <- CJ(1:n, 1:n) # generate all row combinations
      dl <- sapply(df, function(x) sqrt((x[i[[1]]] - x[i[[2]]])^2)) # loop over columns
      dv <- rowSums(dl) # sum values of columns
      d <- matrix(dv, n, n) # fill in matrix
      d
    }
    dst3 <- euclidean3(iris[,1:4])
    all.equal(euclidean(iris[,1:4]), dst3) # TRUE
    
    [1] "Time taken [Euclidean3]:  0.008"
    [1] "Time taken [Daisy]:  0.002"
    

    Largest bottleneck in your code is selecting data.frame elements in loop (df[j,v])). Maybe changing it to matrix also could improver speed. I believe there could be more performant approach on stackoverflow, you just need to search by correct keywords...