Search code examples
rparallel-processingplyrmc

multicore with plyr, MC


Hi I am trying to use ddply in the plyr library in R, with the MC package. It doesn't seem to be speeding up the computation. This is the code I run:

require(doMC)
registerDoMC(4)
getDoParWorkers()
##> 4
test <- data.frame(x=1:10000, y=rep(c(1:20), 500))
system.time(ddply(test, "y", mean))
  # user  system elapsed 
  # 0.015   0.000   0.015
system.time(ddply(test, "y", mean, .parallel=TRUE))
  # user  system elapsed 
  # 223.062   2.825   1.093 

Any ideas?


Solution

  • The mean function operates too quickly relative to the communication costs required to distribute the split sections to each core and retrieve the results.

    This is a common "problem" people run into with distributed computing. They expect it to make everything run faster because they forget there are costs (communication between the nodes) as well as benefits (using multiple cores).

    Something specific to parallel processing in plyr: only the function is run on multiple cores. The splitting and combining still is still done on a single core, so the function you're applying would have to be very computationally intensive to see a benefit when using plyr functions in parallel.