Search code examples
linuxrlapplymclapply

improper use case of mclapply?


I've been playing with rstudio server on an old laptop with linux mint (debian) installed.

I've always ran on windows, so I've never taken advantage of parallel or multicore packages, and my goal was to learn rstudio server as well as R linux and how multicore processing could speed up my processes.

One main use of lapply that I use on an everyday basis looks like this:

f <- function(x) {
   x1 <- data[1:50, x]
   x2 <- data[51:100, x]

   line <- c(paste0(mean(x1), " (", sd(x1), ")"),
             paste0(mean(x2), " (", sd(x2), ")"),
             t.test(x1, x2)$p.value)
   return(line)
 }

data <- data.frame(matrix(rnorm(2600, 85, 19), nrow=100, ncol=26))
names(data) <- letters

do.call(rbind, lapply(letters, f))

 microbenchmark(
     do.call(rbind, lapply(letters, f))
 )

Median time is 21.8 milliseconds

Alternatively:

library(parallel)
microbenchmark(
     do.call(rbind, mclapply(letters, f))
)

Median time is 120.9 milliseconds.

Why this huge difference?

The machine is a 2-core dinosaur. Is it that you do not see benefits until you are working with >= 4-core machines? Is my use case (column wise calculations of a data.frame) improper to see benefits?

Thank you!


Solution

  • Your data is to small to have an advantage against the overhead, try

    f <- function(x) {
      x1 <- data[1:50000, x]
      x2 <- data[50001:100000, x]
    
      line <- c(paste0(mean(x1), " (", sd(x1), ")"),
                paste0(mean(x2), " (", sd(x2), ")"),
                t.test(x1, x2)$p.value)
      return(line)
    }
    
    data <- data.frame(matrix(rnorm(2600, 85, 19), nrow=100000, ncol=26))
    

    instead and check the result. Your example took my laptop 7 and 17 median miliseconds, but my bigger example changes this into 120 and 80. So in my opinion it's (not only) the number of cores, but more the size of your data in this case.