Search code examples
rdoparallelparallel-foreachsnowbigstatsr

FBM() of bigstatsr does not calculate the matrix correctly while using parallel foreach as it does when the code is running in a simple for loop


I need to estimate a transition Matrix. Since I have lots of data I tried to run it parallel using foreach and I tried the shared-memory function FBM() of bigstatsr. And it seems that the function does not all the time return the right result. (Sometimes it does.) Could it be the case that the function doesn’t work properly?

Here is an example when the code works correctly:

x <- c(1,2,1,1,3,4,4,1,2,4,1,4,3,4,4,4,3,1,3,2,3,3,3,4,2,2,3)
n <- length(unique(x))
A <- matrix(nrow = n, ncol = n, 0)
for (t in 1:(length(x) - 1)) {A[x[t], x[t + 1]] <- A[x[t], x[t + 1]] + 1}
A

And here is the Code that does not always return the correct result:

library(foreach)
library(doParallel)
library(bigstatsr)

cl <- makeCluster(8)
registerDoParallel(cl)

B <- FBM(n, n)
set.seed(3)

foreach (t = 1:(length(x) - 1))  %dopar% {B[x[t], x[t + 1]] <- B[x[t], x[t + 1]] + 1}
stopCluster(cl)

B[]
identical(A,B[])

The same happens when using snow library

library(snow)
library(bigstatsr)
cl <- makeCluster(8)
f.trans.m <- function(t) {
  D[x[t], x[t + 1]] <<- D[x[t], x[t + 1]] + 1
}
D <- FBM(n, n)
clusterExport(cl, "f.trans.m")
clusterExport(cl, "D")
clusterExport(cl, "x")
parLapply(cl, seq(1,(length(x) - 1)), function(t) f.trans.m(t))
D[]
identical(A,D[])

Do I use the package correctly, or is there a bug in FBM()?

a solution:

A file lock was missing which is provided by the package flock.

B <- FBM(n, n)
lock <- tempfile()
foreach (t = 1:(length(x) - 1))  %dopar% {
  locked <- flock::lock(lock)
  B[x[t], x[t + 1]] <- B[x[t], x[t + 1]] + 1
  flock::unlock(locked)
}

Solution

  • For this particular example, the problem is with parallel concurrent update of values (cf. https://privefl.github.io/blog/a-guide-to-parallelism-in-r/#advanced-parallelism-synchronization).

    Here, I would not use parallelism at all. I would rather go for a sequential (but vectorized) accessor.

    I would first regroup the indices to increment:

    library(dplyr)
    ind <- data.frame(i = x[-length(x)], j = x[-1]) %>%
      group_by(i, j) %>%
      count()
    

    Then, I would use the two-column matrix accessor to update corresponding values without using an R loop.

    B <- FBM(n, n, init = 0)
    ind2 <- as.matrix(ind[1:2])
    B[ind2] <- B[ind2] + ind[[3]]