I am using R (for Windows 7, 32 -bit) for doing text classification using randomForests
. Due to large dataset, I looked up the Internet for speeding up model-building and came across randomForestSRC
package.
I have followed all the steps in the installation manual for the package, yet during execution of rfsrc()
command, only one of the logical cores is used by R (same as randomforest()
), the maximum cpu utilization being 25%.
I have used following command as per the manual.
options(mc.cores=detectcores()-1, rf.cores = detectcores()-1)
I am using Windows 7 Professional 32 bit Service Pack 1, on Intel i3 2120 CPU with 4 logical cores. Could anyone throw some light on what I could be missing? Any other efficient way to use randomForest
with multicore utilization will also be helpful!
The problem is that randomForestSRC
uses the mclapply
function for parallel execution, but mclapply
doesn't support parallel execution on Windows. randomForestSRC
can also use OpenMP for multithreaded parallel execution, but that isn't built into the binary distribution from CRAN, so you have to build the package from source with OpenMP support enabled.
I think your two options are:
randomForestSRC
with OpenMP support on your Windows machine;Here's a simple parallel example using the randomForest
package with foreach
and doParallel
that is derived from an example in the foreach
vignette:
library(randomForest)
library(doParallel)
workers <- detectCores()
cl <- makePSOCKcluster(workers)
registerDoParallel(cl)
x <- matrix(runif(500), 100)
y <- gl(2, 50)
ntree <- 1000
rf <- foreach(n=rep(ceiling(ntree/workers), workers),
.combine=combine, .multicombine=TRUE,
.packages='randomForest') %dopar% {
randomForest(x, y, ntree=n)
}
This example should work on Windows, Mac OS X and Linux. See the foreach vignette for more information.