I've been playing with rstudio server on an old laptop with linux mint (debian) installed.
I've always ran on windows, so I've never taken advantage of parallel
or multicore
packages, and my goal was to learn rstudio server
as well as R linux
and how multicore processing could speed up my processes.
One main use of lapply that I use on an everyday basis looks like this:
f <- function(x) {
x1 <- data[1:50, x]
x2 <- data[51:100, x]
line <- c(paste0(mean(x1), " (", sd(x1), ")"),
paste0(mean(x2), " (", sd(x2), ")"),
t.test(x1, x2)$p.value)
return(line)
}
data <- data.frame(matrix(rnorm(2600, 85, 19), nrow=100, ncol=26))
names(data) <- letters
do.call(rbind, lapply(letters, f))
microbenchmark(
do.call(rbind, lapply(letters, f))
)
Median time is 21.8
milliseconds
Alternatively:
library(parallel)
microbenchmark(
do.call(rbind, mclapply(letters, f))
)
Median time is 120.9
milliseconds.
Why this huge difference?
The machine is a 2-core dinosaur. Is it that you do not see benefits until you are working with >= 4-core machines? Is my use case (column wise calculations of a data.frame) improper to see benefits?
Thank you!
Your data is to small to have an advantage against the overhead, try
f <- function(x) {
x1 <- data[1:50000, x]
x2 <- data[50001:100000, x]
line <- c(paste0(mean(x1), " (", sd(x1), ")"),
paste0(mean(x2), " (", sd(x2), ")"),
t.test(x1, x2)$p.value)
return(line)
}
data <- data.frame(matrix(rnorm(2600, 85, 19), nrow=100000, ncol=26))
instead and check the result. Your example took my laptop 7 and 17 median miliseconds, but my bigger example changes this into 120 and 80. So in my opinion it's (not only) the number of cores, but more the size of your data in this case.