Search code examples
rperformancefor-loopvectorization

Optimize for loops searching for closest points


I'm trying to optimize a for loop that takes way too long. I'm sure it can be optimized, but as I'm fairly new to R, I'm not sure how to do it.

I have two matrices, tarU_x and src_x. For each row in tarU_x, I want to find the closest one in src_x and assign the same label (I have the labels for src_x in src_y, and the estimated labels for tarU_x will be in tarU_y).

I'm doing it with classic nested for loops, which are not very efficient, so I would like to take advantage of the possibilities that R gives. The code is the following:

# Estimate tarU_y
tarU_y <- vector()
for (i in 1:nrow(tarU_x)) {
  tarU_vector <- tarU_x[i,]
  lowest_dist <- Inf
  lowest_dist_class <- -1
  for (j in 1:nrow(src_x)) {
    distance <- dist(rbind(tarU_vector, src_x[j,]))
    if (distance < lowest_dist) {
      lowest_dist <- distance
      lowest_dist_class <- src_y[j]
    }
  }
  tarU_y[i] <- lowest_dist_class
}

EDIT

I tried using apply, as s__ suggested, and got it to work, ending up with this code:

distances <- apply(src_x, 1, function (y) apply(tarU_x, 1, function(x) dist(rbind(x,y))))
tarU_y <- apply(distances, 1, function(x) src_y[which.min(x)])

But it seems to be a bit slower, I guess the underlying code is pretty similar. For example, a test with the for loops took 14 secondds, while using apply took 16 seconds.

For more info, the data I'm using is the one provided here: https://archive.ics.uci.edu/ml/datasets/Gas+Sensor+Array+Drift+Dataset+at+Different+Concentrations, which comes in 10 different batches, and every sample has 128 features.


Solution

  • Try the distmat function in library(pracma):

    library(pracma)
    tarU_y <- src_y[max.col(-distmat(tarU_x, src_x))]
    

    EDIT: benchmark added

    Illustrated using matrices of random normals:

    library(pracma)
    library(microbenchmark)
    
    set.seed(123)
    tarU_x <- matrix(rnorm(1e4, mean = rep(1:100, 10)), nrow = 100L)
    src_x <- matrix(rnorm(2e4, mean = rep(200:1, 10)/2), nrow = 200L)
    src_y <- rep(200:1, 10)/2
    
    using.forloop <- function(x1, x2, y1) {
      y2 <- rep(y1[1], nrow(x2))
      for (i in 1:nrow(x2)) {
        lowest_dist <- dist(rbind(x1[1,], x2[i,]))
        for (j in 2:nrow(x1)) {
          distance <- dist(rbind(x1[j,], x2[i,]))
          if (distance < lowest_dist) {
            lowest_dist <- distance
            y2[i] <- y1[j]
          }
        }
      }
      return(y2)
    }
    
    using.distmat <- function(x1, x2, y1) {
      return(y1[max.col(-distmat(x2, x1))])
    }
    
    all.equal(using.forloop(src_x, tarU_x, src_y), using.distmat(src_x, tarU_x, src_y))
    [1] TRUE
    
    microbenchmark(using.forloop(src_x, tarU_x, src_y), using.distmat(src_x, tarU_x, src_y))
    
    Unit: milliseconds
                                    expr      min        lq       mean   median        uq      max neval
     using.forloop(src_x, tarU_x, src_y) 415.8176 447.95200 473.345159 462.0715 495.33775 609.8592   100
     using.distmat(src_x, tarU_x, src_y)   2.4413   2.59575   2.779786   2.7072   2.91965   3.8540   100