Search code examples
rconcurrencyparallel-processingmulticore

Updating the same memory (matrix) on parallel computations?


I have a strong use case for parallelizing a flavor of the SGD algorithm. In such use-case I need to update the matrices P and Q with the delta gradient update and for a random batch of samples. Each process will update mutually exclusive indices on both matrices.

A simple illustration of what I intend to do would be something like this:

# create "big" matrix
A <- matrix(rnorm(10000), 100, 100)
system.time(
  # update each row vector independently using all my cores
  r <- mclapply(1:100, mc.cores = 6, function(i) {
    # updating ... 
    A[i,] <- A[i,] - 0.01
    # return something, i.e. here I'd return the RMSE of this batch instead   
    sqrt(sum(A[i,]^2))
  }) 
)

Are there any drawbacks on using this approach? are there more R-idiomatic alternatives?

For example, to be clean (i.e. no side effects, immutable computation) returning the update A[i,] - 0.01 instead of the RMSE would be more complex to program and peak on memory usage or even run out of memory.


Solution

  • Reimplementing your code, by block, using shared data with package {bigstatsr}:

    N <- 10e3
    A <- matrix(rnorm(N * N), N)
    
    library(bigstatsr)
    bigA <- as_FBM(A)
    
    library(doParallel)
    registerDoParallel(cl <- makeCluster(4))
    system.time(
      r <- foreach(i = seq_len(N), .combine = 'c') %dopar% {
        # updating ... 
        A[i,] <- A[i,] - 0.01
        # return something, i.e. here I'd return the RMSE of this batch instead   
        sqrt(sum(A[i,]^2))
      }
    ) # 11 sec
    stopCluster(cl)
    
    registerDoParallel(cl <- makeCluster(4))
    system.time(
      r2 <- big_apply(bigA, function(X, ind) {
        # updating ... 
        tmp <- bigA[ind, ] <- bigA[ind, ] - 0.01
        # return something, i.e. here I'd return the RMSE of this batch instead   
        sqrt(rowSums(tmp^2))
      }, a.combine = 'c')
    ) # 1 sec
    stopCluster(cl)
    
    all.equal(r, r2) # TRUE
    

    Again, it would be better to update columns instead of rows.