Search code examples
rloopsfor-loopsubsetassign

R rewriting a for loop


I've got a loop in my code that I would like to rewrite so running the code takes a little less time to compete. I know you allways have to avoid loops in the code but I can't think of an another way to accomplice my goal.

So I've got a dataset "df_1531" containing a lot of data that I need to cut into pieces by using subset() (if anyone knows a better way, let me know ;) ). I've got a vector with 21 variable names on which I like assign a subset of df_1531. Furthermore the script contains 22 variables with constrains (shift_XY_time).

So, this is my code now...

# list containing different slots
shift_time_list<- c(startdate, shift_1m_time, shift_1a_time, shift_1n_time,
                               shift_2m_time, shift_2a_time, shift_2n_time,
                               shift_3m_time, shift_3a_time, shift_3n_time,
                               shift_4m_time, shift_4a_time, shift_4n_time, 
                               shift_5m_time, shift_5a_time, shift_5n_time,
                               shift_6m_time, shift_6a_time, shift_6n_time,
                               shift_7m_time, shift_7a_time, shift_7n_time)
# List with subset names 
shift_sub_list <- c("shift_1m_sub", "shift_1a_sub", "shift_1n_sub",
                    "shift_2m_sub", "shift_2a_sub", "shift_2n_sub",
                    "shift_3m_sub", "shift_3a_sub", "shift_3n_sub",
                    "shift_4m_sub", "shift_4a_sub", "shift_4n_sub", 
                    "shift_5m_sub", "shift_5a_sub", "shift_5n_sub",
                    "shift_6m_sub", "shift_6a_sub", "shift_6n_sub",
                    "shift_7m_sub", "shift_7a_sub", "shift_7n_sub")

# The actual loop that I'd like to rewrite
for (i in 1:21) {
  assign(shift_sub_list[i], subset(df_1531, df_1531$'PLS FFM' >= shift_time_list[i] & df_1531$'PLS FFM' < shift_time_list[i+1]))
}

Running the loop takes approximately 6 or 7 seconds. So, if anyone knows a better/cleaner or quicker way to write my code, I desperately like to hear your suggestion/opinion.

**Reproducible example **

mydata <- cars

dput(cars)
structure(list(speed = c(4, 4, 7, 7, 8, 9, 10, 10, 10, 11, 11, 
                         12, 12, 12, 12, 13, 13, 13, 13, 14, 14, 14, 14, 15, 15, 15, 16, 
                         16, 17, 17, 17, 18, 18, 18, 18, 19, 19, 19, 20, 20, 20, 20, 20, 
                         22, 23, 24, 24, 24, 24, 25), dist = c(2, 10, 4, 22, 16, 10, 18, 
                                                               26, 34, 17, 28, 14, 20, 24, 28, 26, 34, 34, 46, 26, 36, 60, 80, 
                                                               20, 26, 54, 32, 40, 32, 40, 50, 42, 56, 76, 84, 36, 46, 68, 32, 
                                                               48, 52, 56, 64, 66, 54, 70, 92, 93, 120, 85)), class = "data.frame", row.names = c(NA, 
                                                                                                                                                  -50L))

dist_interval_list <- c(  0,   5,  10,  15,
                         20,  25,  30,  35, 
                         40,  45,  50,  55, 
                         60,  65,  70,  75,
                         80,  85,  90,  95,
                        100, 105, 110, 115, 120)


var_name_list <- c("var_name_1a", "var_name_1b", "var_name_1c", "var_name_1d",
                    "var_name_2a", "var_name_2b", "var_name_2c", "var_name_2d",
                    "var_name_3a", "var_name_3b", "var_name_3c", "var_name_3d",
                    "var_name_4a", "var_name_4b", "var_name_4c", "var_name_4d",
                    "var_name_5a", "var_name_5b", "var_name_5c", "var_name_5d",
                    "var_name_6a", "var_name_6b", "var_name_6c", "var_name_6d")


for (i in 1:24){
  assign(var_name_list[i], subset(mydata,
                                       mydata$dist >= dist_interval_list[i] & 
                                       mydata$dist < dist_interval_list[i+1]))
}


Solution

  • Starting with the 'reproducible' part and the information that the final aim is to summarize another column, it is possible to exploit the fact that the intervals are non-overlapping and simply use the cut function.

    library(tidyverse)
    
    mydata %>% 
      mutate(interval = cut(dist, breaks = dist_interval_list)) %>% 
      group_by(interval) %>% 
      summarise(sum = sum(speed))
    

    This should be much faster and will also help you not to get lost in a messy environment full of variables (which are actually part of your data). You want to keep all your data in a single data frame as long as possible;) You probably want to follow with something like purrrlyr::invoke_rows at the final modeling step, if your function does not work with data frames.