Search code examples
rparallel-processingbioconductor

Parallel processing with BiocParallel running much longer than serial


I am trying to use parallel processing to speed up running many Boosted Regression Trees in R. I am using the BiocParallel package (http://lcolladotor.github.io/2016/03/07/BiocParallel/#.WiqF7bQ-e3c). I have created some dummy data and then set up a function to run two BRT models, which I hoped to time in Serial then in Parallel. However, my Parallel run never seems to complete, while my Serial run only takes about 3 seconds.

##CAN I USE PARALLEL PROCESSING TO SPEED UP BRT'S?

##LOAD PACKAGES
library(BiocParallel)
library(dismo)
library(gbm)
library(MASS)

##CREATE RANDOM, CORRELATED DATA
## FROM https://www.r-bloggers.com/simulating-random-multivariate-correlated-data-continuous-variables/
R = matrix(cbind(1,.80,.2,  .80,1,.7,  .2,.7,1),nrow=3)
U = t(chol(R))
nvars = dim(U)[1]
numobs = 100
set.seed(1)
random.normal = matrix(rnorm(nvars*numobs,0,1), nrow=nvars, ncol=numobs);
X = U %*% random.normal
newX = t(X)
raw = as.data.frame(newX)
orig.raw = as.data.frame(t(random.normal))
names(raw) = c("response","predictor1","predictor2")
cor(raw)


###########################################################
##  MODEL
##########################################################


##WITH FUNCTIONS, 

Tc<-c(4, 8) ##Tree Complexities

Lr<-c(0.01)  ## Learning Rates

Vars <- split(expand.grid(Tc,Lr),seq(nrow(expand.grid(Tc,Lr))))

brt <- function(x){
  a <- gbm.step(raw,gbm.x=c(2:3),gbm.y="response",tree.complexity=x[1],learning.rate=x[2],bag.fraction=0.65, family="gaussian")
  b <- data.frame(model=paste("Tc= ",x[1]," _ ","Lr= ",x[2],sep=""), R2=a$cv.statistics$correlation.mean, Dev=a$cv.statistics$deviance.mean)
  ##Reassign model with unique name
  assign(paste("patch.tc",x[1],".lr",x[2],sep=""),a, envir = .GlobalEnv)
  assign(paste("RESULTS","patch.tc",x[1],".lr",x[2],sep=""),b, envir = .GlobalEnv)
  print(b)
}



############################
###IN Serial
############################

system.time(
lapply(Vars, brt)
)


############################
###IN PARALLEL
############################

system.time(
bplapply(Vars, brt)
)

Solution

  • Some quick comments:

    1. Always avoid assign(); if you find yourself using it, it's a good sign you're approaching the problem the wrong way.

    2. Assign variables to global environment from within a function (using assign() or <<-) is always a bad idea and again, a hint that there is a better solution that you should use.

    3. If you still choose to break 1 and 2 above, it will certainly not work when you use it parallel processing.

    4. Instead, return your values (see below).

    5. That dismo::gbm.step() function tries to plot by default (plot.main = TRUE). That will not work (actually invalid) in so called forked parallel processing, which is often the default go-to on Unix and macOS.

    6. Plotting in parallel is often not what you want to do (unless you plot an image file or similar).

    To your problem: After modifying your brt() to (according to 1-6):

    brt <- function(x){
      a <- gbm.step(raw, gbm.x=c(2:3), gbm.y="response", tree.complexity=x[1], learning.rate=x[2], bag.fraction=0.65, family="gaussian", plot.main = FALSE)
      b <- data.frame(model=paste("Tc= ", x[1], " _ ", "Lr= ", x[2], sep=""), R2=a$cv.statistics$correlation.mean, Dev=a$cv.statistics$deviance.mean)
      list(a = a, b = b)
    }
    

    it works for me bplapply(Vars, brt) as well as with future::future_lapply(Vars, brt). With parallel::parLapply(cl, Vars, brt) you need to take more care exporting globals.

    PS. I would probably just return a and extract the b info outside.