Search code examples
rlapplypurrr

Optimize a large number of variable operations and variable ordering


I would like some suggestions on speeding up the code below. The flow of the code is fairly straight forward. I'm running R version 4.0.3 in Windows.

  1. with combn, create a vector of unique combinations (m=3, 4, or 5) from df variable names (i.e., var1*var2*var3...var1*var2*varN)
  2. transform the vector of combinations into a list of formulas
  3. split the list of formulas into chunks to get around memory limitations (required to run step 4)
  4. iterate through each chunk from step 3 and perform the formula operation on the df. save the values resulting from the operations step in a separate list (ops_list_temp) to use in step 5
  5. for each element in ops_list_temp, find the indices of the largest n values based on the user specified topn and save results to indices_list
  6. for each element in the indices_list, subset the df by the indices in each indices_list element and store the corresponding value in the values_list

The full reprex is below including the different attempts using purrr::map and base lapply. I also attempted to use:= from data.table following the link below but I was unable to figure out how to transform the list of formulas into formulas that could be fed to qoute(:=(...)):

Apply a list of formulas to R data.table

It appears to me that one of the bottlenecks in my code is in variable operation step (STEP 4). With m=4 and number of variables of 90, there are a total of 2,555,190 elements (RcppAlgos::comboCount(v = 90, m = 4, repetition = FALSE). Breaking this up into chunks of 10,000 to get around memory limitations results in a list of 256 elements.

With m=5, there are 43,949,268 elements (RcppAlgos::comboCount(v = 90, m = 5, repetition = FALSE) and a chunks list of ~4,440 elements.

A previous bottleneck was in the ordering step that I've managed to speed up quite a bit using the library kit and the link below but any suggestions that could speed up the entire flow is appreciated. The example I'm posting here uses combn of 4 as that is typically what I use in my workflow but I would also like to be able to go up to combn of 5 if the speed is reasonable.

Fastest way to find second (third...) highest/lowest value in vector or column

library(purrr)
library(stringr)
library(kit)

df <- data.frame(matrix(data = rnorm(80000*90,200,500), nrow = 80000, ncol = 90))
df$value <- rnorm(80000,200,500)
cols <- names(df)
cols <- cols[!grepl("value", cols)]
combination <- 4

STEP 1:
## create unique combinations of column names
ops_vec <- combn(cols, combination, FUN = paste, collapse = "*")

STEP 2:
## transform ops vector into list of formulas
ops_vec_l <- purrr::map(ops_vec, .f = function(x) str_split(x, "\\*", simplify = T))

STEP 3:
## break up the list of formulas into chunks otherwise memory error
chunks_run <- split(1:length(ops_vec_l), ceiling(seq_along(ops_vec_l)/10000))

## store results of each chunk into one final list
chunks_list <- vector("list", length = length(chunks_run))

STEP 4:
ptm <- Sys.time()
chunks_idx <- 1
for (chunks_idx in seq_along(chunks_run))
{
  STEP 4 (cont):
  ## using purrr::map
  # p <- Sys.time()
  ele_length <- length(chunks_run[[chunks_idx]])
  ops_list_temp <- vector("list", length = ele_length)
  ops_list_temp <- purrr::map(
    ops_vec_l[ chunks_run[[chunks_idx]] ], .f = function(x) df[,x[,1]]*df[,x[,2]]*df[,x[,3]]*df[,x[,4]]
  )
  # (p <- Sys.time()-p)  #Time difference of ~ 3.6 secs to complete chunk of 10,000 operations
  
  # ## using base lapply
  # p <- Sys.time()
  # ele_length <- length( ops_vec_l[ chunks_run[[chunks_idx]] ])
  # ops_list_temp <- vector("list", length = ele_length)
  # ops_list_temp <- lapply(
  #   ops_vec_l[ chunks_run[[chunks_idx]] ], function(x) df[,x[,1]]*df[,x[,2]]*df[,x[,3]]*df[,x[,4]]
  # )
  # (p <- Sys.time()-p) #Time difference of ~3.7 secs to complete a chunk of 10,000 operations
  
  ## number of rows I want to subset from df
  topn <- 250
  
  ## list to store indices of topn values for each list element
  indices_list <- vector("list", length = length(ops_list_temp))
  
  ## list to store value of the topn indices for each list element
  values_list <- vector("list", length = length(ops_list_temp))
  
  STEP 5:
  ## for each variable combination in "ops_list_temp" list, find the index (indices) of the topn values in decreasing order
  ## each element in this list should be the length of topn
  indices_list <- purrr::map(ops_list_temp, .f = function(x) kit::topn(vec = x, n = topn, decreasing = T, hasna = F))
  
  STEP 6:
  ## after finding the indices of the topn values for a given variable combination, find the value(s) corresponding to index (indices) and store in the list
  ## each element in this list, should be the length of topn
  values_list <- purrr::map(indices_list, .f = function(x) df[x,"value"])
  
  ## save completed chunk to final list
  chunks_list[[chunks_idx]] <- values_list
}
(ptm <- Sys.time()-ptm) # Time difference of 41.1 mins

Solution

    1. When you are memory constrained, you need to avoid intermediate assignments of large objects.
    2. In this case there is no reason to use names instead of integer indices to iterate over.
    3. In step 1 you are paste()ing your values only to split them again in step 2, why?
    4. When you need more speed, parallelization can be a way to go. Your problem is highly parallelizable, but it also increases memory usage, so your mileage may vary.

    In the piece of code below, I took your problem and applied these lessons. On my fairly old i5 6267U dual core processor, it took 8 seconds to run the parallelized future_map() on 10000 observations. That is equivalent to one iteration of your for loop which takes 46 seconds on my machine. So this yields and approximate speedup of 6x. Since your loop was not parallelized, you may see an even greater increase if you have a more modern processor with more cores. The preparatory steps before that are also much faster.

    library(stringr)
    library(kit)
    library(furrr)
    
    plan(multisession)
    
    # Parameters
    tpn <- 250 # set topn n parameter
    combination <- 4
    
    # Data
    df <- data.frame(matrix(data = rnorm(80000*90,200,500), nrow = 80000, ncol = 90))
    df$value <- rnorm(80000,200,500)
    
    # Calculations
    cols <- which(names(df) != "value") # indices for all columns but `value`
    cbn <- combn(cols, combination, simplify = F) # combinations
    
    
    result <- cbn |> 
      future_map(\(cb) df[, cb] |> # select the respective columns 
              Reduce(f = `*`) |> # rowwise product
              kit::topn(tpn) |>
              (\(x) df[x, "value"])() # select corresponding values
            )
    
    

    Edit

    On R 4.0 you can use this:

    library(purrr)
    result <- cbn %>% 
      future_map(function(cb) df[, cb] %>% 
                   Reduce(f = `*`) %>%
                   kit::topn(tpn, hasna = F) %>%
                   `[`(df, ., "value")
                 )
    

    If the multithreading still does not work, replace future_map() with map() to run it sequentially. Also, when you are testing, you may want to restrict the data to a subset like cbn[seq_len(1e4)] so that you don't have to wait around for the whole thing to finish.