Search code examples
rdistancedoparallel

Inconsistent dist() foreach results


I have data that is roughly in the following format but is very large but is broken up by groups using the class and uniqueId variable. Where each location is a pair row wise (x, y).

df <- 
  data.frame(
    x = c(1, 2, 3, 4, 5, 6, 8, 9, 10), 
    y = c(1, 2, 3, 4, 5, 6, 8, 9, 10), 
    class = c(0, 0, 0, 0, 0, 1, 0, 1, 0), 
    uniqueId = c("1-2-3", "1-2-3", "1-2-3", "1-2-4", "1-2-4", "1-2-4", "1-3-2", "1-3-2", "1-3-2"),
    partialId = c("1.2", "1.2", "1.2", "1.2", "1.2", "1.2", "1.3", "1.3", 1.3") 
  )

The function I am using should go through the dataframe and calculate the smallest distance to another object within the same uniqueId but different class as the current row. To do this I've broken my data into chunks the following way.

indexes <-
  df %>%
  select(partialId) %>%
  unique()

j <- 1

library(doParallel)

class_separation <- c()

cl <- makePSOCKcluster(24)

registerDoParallel(cl)


while(j <= nrow(indexes)) {

  test <- df %>% filter(partialId == indexes$partialId[j])
  n <- nrow(test)
  vec <- numeric(n)
  vec <- foreach(k = 1:n, .combine = 'c', .multicombine = F) %dopar% {
    c(
      min(
        apply(
          test[test$uniqueId == test$uniqueId[k] & test$class != test$class[k], c("x","y")],
          1,
          function(x) dist(rbind(c(test$x[k],test$y[k]), c(x[1], x[2])))
        )
      )
    )
  }
  class_separation <- c(class_separation, vec)
  j <- j + 1
}
endtime <- Sys.time()
stopwatch <- endtime - starttime
closeAllConnections()
registerDoSEQ()
gc()
df <- cbind(df, class_separation)

When handling single plays or small batches, this code seems to operate just fine. However, when handling the full dataset I am getting results that are obviously incorrect. I know there must be a flaw in how I am calculating these distances since there is very little chance the dist() function itself or %dopar% is at fault. I have changed to %do% and my results do not change.

As an example of the discrepancy, the following image shows the class_separation column from when the full data run is conducted vs when I feed it a small example. As you can see the results are wildly different, but I'm not sure why.

results image


Solution

  • After a day of thinking about this, the problem is in how I was sending my df to dist().

    For example, if we intended to pass

    dist(rbind(c(1, 1), c(6, 6)))
    
    dist(rbind(c(1, 1), c(9, 9)))
    

    What we actually pass is dist(rbind(c(1, 1), c(6, 6, 9, 9)))

    This is obviously not what I want. I needed both distances and then to select the minimum between them or add in other conditionals. The way to do this I found was using the rdist package.

    foreach(i = 1:nrow(df), .combine = 'c', .multicombine = F, .packages = c('tidyverse', 
      'rdist')) %dopar% {
        min(
          cdist(
            df[df$class != df$class[i] & df$uniqueId == df$uniqueId[i], ] %>% select(x, y), 
            df %>% select(x, y) %>% slice(i)
          )
        )
    }
    

    For our test data this returns the vector

    Inf Inf Inf 2.828427 1.414214 1.414214 1.414214 1.414214 1.414214

    Which is exactly what I needed. The first three entries having no class == 1 options for their uniqueId should return Inf, row 4 is twice as far from row 6 as row 5 while all having the same uniqueId, while row 9 is equally distance to rows 8 and 10. Whether this solution will be sufficiently fast I will test out.