I would like some suggestions on speeding up the code below. The flow of the code is fairly straight forward. I'm running R version 4.0.3 in Windows.
combn
, create a vector of unique combinations (m=3, 4, or 5) from df
variable names (i.e., var1*var2*var3...var1*var2*varN
)df
. save the values resulting from the operations step in a separate list (ops_list_temp
) to use in step 5ops_list_temp
, find the indices of the largest n values based on the user specified topn
and save results to indices_list
indices_list
, subset the df
by the indices in each indices_list
element and store the corresponding value
in the values_list
The full reprex is below including the different attempts using purrr::map
and base lapply
. I also attempted to use:=
from data.table
following the link below but I was unable to figure out how to transform the list of formulas into formulas that could be fed to qoute(:=(...))
:
Apply a list of formulas to R data.table
It appears to me that one of the bottlenecks in my code is in variable operation step (STEP 4). With m=4
and number of variables of 90, there are a total of 2,555,190 elements (RcppAlgos::comboCount(v = 90, m = 4, repetition = FALSE)
. Breaking this up into chunks of 10,000 to get around memory limitations results in a list of 256 elements.
With m=5
, there are 43,949,268 elements (RcppAlgos::comboCount(v = 90, m = 5, repetition = FALSE)
and a chunks list of ~4,440 elements.
A previous bottleneck was in the ordering step that I've managed to speed up quite a bit using the library kit
and the link below but any suggestions that could speed up the entire flow is appreciated. The example I'm posting here uses combn
of 4 as that is typically what I use in my workflow but I would also like to be able to go up to combn
of 5 if the speed is reasonable.
Fastest way to find second (third...) highest/lowest value in vector or column
library(purrr)
library(stringr)
library(kit)
df <- data.frame(matrix(data = rnorm(80000*90,200,500), nrow = 80000, ncol = 90))
df$value <- rnorm(80000,200,500)
cols <- names(df)
cols <- cols[!grepl("value", cols)]
combination <- 4
STEP 1:
## create unique combinations of column names
ops_vec <- combn(cols, combination, FUN = paste, collapse = "*")
STEP 2:
## transform ops vector into list of formulas
ops_vec_l <- purrr::map(ops_vec, .f = function(x) str_split(x, "\\*", simplify = T))
STEP 3:
## break up the list of formulas into chunks otherwise memory error
chunks_run <- split(1:length(ops_vec_l), ceiling(seq_along(ops_vec_l)/10000))
## store results of each chunk into one final list
chunks_list <- vector("list", length = length(chunks_run))
STEP 4:
ptm <- Sys.time()
chunks_idx <- 1
for (chunks_idx in seq_along(chunks_run))
{
STEP 4 (cont):
## using purrr::map
# p <- Sys.time()
ele_length <- length(chunks_run[[chunks_idx]])
ops_list_temp <- vector("list", length = ele_length)
ops_list_temp <- purrr::map(
ops_vec_l[ chunks_run[[chunks_idx]] ], .f = function(x) df[,x[,1]]*df[,x[,2]]*df[,x[,3]]*df[,x[,4]]
)
# (p <- Sys.time()-p) #Time difference of ~ 3.6 secs to complete chunk of 10,000 operations
# ## using base lapply
# p <- Sys.time()
# ele_length <- length( ops_vec_l[ chunks_run[[chunks_idx]] ])
# ops_list_temp <- vector("list", length = ele_length)
# ops_list_temp <- lapply(
# ops_vec_l[ chunks_run[[chunks_idx]] ], function(x) df[,x[,1]]*df[,x[,2]]*df[,x[,3]]*df[,x[,4]]
# )
# (p <- Sys.time()-p) #Time difference of ~3.7 secs to complete a chunk of 10,000 operations
## number of rows I want to subset from df
topn <- 250
## list to store indices of topn values for each list element
indices_list <- vector("list", length = length(ops_list_temp))
## list to store value of the topn indices for each list element
values_list <- vector("list", length = length(ops_list_temp))
STEP 5:
## for each variable combination in "ops_list_temp" list, find the index (indices) of the topn values in decreasing order
## each element in this list should be the length of topn
indices_list <- purrr::map(ops_list_temp, .f = function(x) kit::topn(vec = x, n = topn, decreasing = T, hasna = F))
STEP 6:
## after finding the indices of the topn values for a given variable combination, find the value(s) corresponding to index (indices) and store in the list
## each element in this list, should be the length of topn
values_list <- purrr::map(indices_list, .f = function(x) df[x,"value"])
## save completed chunk to final list
chunks_list[[chunks_idx]] <- values_list
}
(ptm <- Sys.time()-ptm) # Time difference of 41.1 mins
paste()
ing your values only to split them again in step 2, why?In the piece of code below, I took your problem and applied these lessons. On my fairly old i5 6267U dual core processor, it took 8 seconds to run the parallelized future_map()
on 10000 observations. That is equivalent to one iteration of your for
loop which takes 46 seconds on my machine. So this yields and approximate speedup of 6x. Since your loop was not parallelized, you may see an even greater increase if you have a more modern processor with more cores. The preparatory steps before that are also much faster.
library(stringr)
library(kit)
library(furrr)
plan(multisession)
# Parameters
tpn <- 250 # set topn n parameter
combination <- 4
# Data
df <- data.frame(matrix(data = rnorm(80000*90,200,500), nrow = 80000, ncol = 90))
df$value <- rnorm(80000,200,500)
# Calculations
cols <- which(names(df) != "value") # indices for all columns but `value`
cbn <- combn(cols, combination, simplify = F) # combinations
result <- cbn |>
future_map(\(cb) df[, cb] |> # select the respective columns
Reduce(f = `*`) |> # rowwise product
kit::topn(tpn) |>
(\(x) df[x, "value"])() # select corresponding values
)
On R 4.0 you can use this:
library(purrr)
result <- cbn %>%
future_map(function(cb) df[, cb] %>%
Reduce(f = `*`) %>%
kit::topn(tpn, hasna = F) %>%
`[`(df, ., "value")
)
If the multithreading still does not work, replace future_map()
with map()
to run it sequentially. Also, when you are testing, you may want to restrict the data to a subset like cbn[seq_len(1e4)]
so that you don't have to wait around for the whole thing to finish.