Search code examples
rforeachtime-serieslapplydoparallel

Optimizing a foreach with an embeded lapply loop - Is it possible to optimize code?


I have the below chunk of code that is running properly but is too time & resource consuming. Especially, when I run it for the full dataset. I was wondering if there is anyway to optimize it and decrease the required processing resources and time keeping the overlap and window_size values the same.

library(foreach)
library(doParallel)

cores=parallel::detectCores()
cl <- makeCluster(cores-1) 
registerDoParallel(cl)

overlap <- 32
window_size <- 512

start_pos <- seq(1, 20480, by = overlap)

lst_B2_512_32 <- list()

##Edit: changed the fixed 2156 length of the loop to length(unique(test_1st_binded_B2$ID)
lst_B2_512_32 <- foreach(i=1:length(unique(test_1st_binded_B2$ID)), .verbose = T) %dopar% {
  lst_B2_512_32[[i]] <- lapply(start_pos, function(x) test_1st_binded_B2[test_1st_binded_B2$ID==i,][(c(x:(x + window_size-1))),])
}

Here is the result of 'dput(head(test_1st_binded_B2))'

structure(list(B2_X = c(-0.183, -0.164, -0.195, -0.159, -0.261, 
-0.281), B2_Y = c(-0.054, -0.183, -0.125, -0.178, -0.098, -0.125
), ID = c(1L, 1L, 1L, 1L, 1L, 1L)), row.names = c(NA, 6L), class = "data.frame")

*Edit - Instead of using the dput reults, this snippet of code helps reproduce the "test_1st_binded_B2" data.frame and easily scale it by changing the value of n.

n <- 2
B2x <- rnorm(20480*n, mean = 0, sd = 1)
B2y <- rnorm(20480*n, mean = 0, sd = 1)
id <- 1:n

test_1st_binded_B2 <- data.frame(B2_X = B2x, B2_Y = B2y, ID = id)

The original test_1st_binded_B2 is a total of 44154880 observations comprised of 20480 observations for each of the 2156 IDs. It is a part of a bearing run to failure simulation data.


Solution

  • There are a couple of approaches that may help. Both of which I would recommend staying away from foreach for now unless there is a more concrete use case.

    library (data.table)
    dt = as.data.table(test_1st_binded_B2)
    dt[, .(lapply(start_pos, function(i) .SD[i:(i + window_size - 1L)])), by = ID]
    

    Using the package, we can make use of high performing grouping. In the original example, each grouping was subsetted. That is, anytime we have a grouping problem DF[ grp == this_grp, ] means that we have to keep comparing the grouping column. Instead, grouping allows us to more efficiently split and apply our functions.

    Going further, lapply(starts, function(i) DF[i:(i + window_size - 1), ] can be somewhat inefficient because we build up a larger dataframe or list. We could instead directly subset the grouped data.

    dt[, .SD[sequence(rep.int(window_size, length(start_pos)), start_pos)
             ][, list(list(.SD)), by = gl(length(start_pos), window_size)]
             , by = ID]
    

    System timings are:

    OP - 1.59 seconds

    First option - 0.32 seconds

    Second Option - 0.08 seconds