r parallel-processing cross-validation gbm

Building parallel GBM models using cross-validation in R

The gbm package in R has a handy feature of parallelizing cross-validation by sending each fold to its own node. I would like to build multiple cross-validated GBM models running over a range of hyperparameters. Ideally, because I have multiple cores, I could also parallelize the building of these multiple models. With 12 cores, I could- in theory- have 4 models building simultaneously with each using 3-fold validation. Something like this:

tuneGrid <- expand.grid(
        n_trees = 5000, 
        shrink = c(.0001),
        i.depth = seq(10,25,5),
        minobs = 100,
        distro = c(0,1) #0 = bernoulli, 1 = adaboost
        )
cl <- makeCluster(4, outfile="GBMlistening.txt")
registerDoParallel(cl) #4 parent cores to run in parallel
err.vect <- NA #initialize
system.time(
err.vect <- foreach (j=1:nrow(tuneGrid), .packages=c('gbm'),.combine=rbind) %dopar% {
        fit <- gbm(Label~., data=training, 
            n.trees = tuneGrid[j, 'n_trees'], 
            shrinkage = tuneGrid[j, 'shrink'],
            interaction.depth=tuneGrid[j, 'i.depth'], 
            n.minobsinnode = tuneGrid[j, 'minobs'], 
            distribution=ifelse(tuneGrid[j, 'distro']==0, "bernoulli", "adaboost"),
            w=weights$Weight,
            bag.fraction=0.5,
            cv.folds=3,
            n.cores = 3) #will this make 4X3=12 workers?
        cv.test <- data.frame(scores=1/(1 + exp(-fit$cv.fitted)), Weight=training$Weight, Label=training$Label)
        print(j) #write out to the listener
        cbind(gbm.roc.area(cv.test$Label, cv.test$scores), getAMS(cv.test), tuneGrid[j, 'n_trees'], tuneGrid[j, 'shrink'], tuneGrid[j, 'i.depth'],tuneGrid[j, 'minobs'], tuneGrid[j, 'distro'], j )
}
)
stopCluster(cl) #clean up after ourselves

I would use the caret package, however I have some hyperparameters beyond those defaulted in caret, and I would prefer not to build my own custom model in caret at this time. I am on a Windows machine, as I know that affects which parallel back-end gets used.

If I do this, will each of the 4 clusters I start up spawn off 3 workers each, for a total of 12 workers chugging away? Or will I only have 4 cores working at once?

Solution

I believe this will do what you want. The foreach loop will run four instances of gbm, and each of them will create a three node cluster using makeCluster. So you'll actually have 16 workers, but only 12 will perform serious computation at any one time. You have to be careful with nested parallelism, but I think this will work.