Search code examples
rfor-loopoptimizationtime-seriesvectorization

Suboptimal use of nested for loops in R. Options for vectorization/optimization?


I have a dataset that stores instances for participants vertically over time. They can have basically any number of follow-ups, there are participants with anywhere from 1 to 14 lines, but more are expected to be added with time.

I have a list of variables var that the participants have presumably reported in each follow-up and want to create a new set of "ever" variables vare that describe if at any time before this follow-up, a participant reported "yes" for the corresponding variable.

Here is an example of the desired input/output:

var  = c("var1","var2")
vare = paste0(var,"_ever")

data = data.frame(idno         = c(123,123,123,123,123,123,123)
                  followup_num = c(0,1,2,3,4,5,6)
                  var1         = c(0,NA,0,1,0,NA,1)
                  var2         = c(1,NA,NA,0,0,0,1)
                 )         
data$var1_ever = c(0,0,0,1,1,1,1)
data$var2_ever = c(1,1,1,1,1,1,1)
idno followup_num var1 var1_ever var2 var2_ever
123 0 0 0 1 1
123 1 NA 0 NA 1
123 2 0 0 NA 1
123 3 1 1 0 1
123 4 0 1 0 1
123 5 NA 1 0 1
123 6 1 1 1 1

This is the code I am currently using. Obviously, nested for loops are not ideal in R and this segment of code is particularly slow when handed a few thousand lines.

#For each ID
for (i in unique(data$idno)) {

  id  = data$idno%in%i              #Get the relevant lines for this ID
  fus = sort(data$followup_num[id]) #Get the follow-up numbers
  
  #For each variable in the list
  for (v in seq_along(var)) {

    #Loop through the follow-ups. If you see that the variable reports "yes", mark 
    #  this and every proceeding follow-up as having reported that variable ever 
    #  Otherwise, mark the opposite at that line and move to the next follow-up
    for (f in fus) {
      if (t(data[id & data$followup_num%in%f,var[v]])%in%1) {
        data[id & data$followup_num >= f,vare[v]] = 1
        break
      } else {
        data[id & data$followup_num%in%f,vare[v]] = 0
      }
    }    
  }
}

Is this a problem with an existing solution? Is there a way to optimize/simplify? Is there a use of apply/sapply/etc. functions that I neglected to try?


Solution

  • At its core, the solution is the base function cummax(). We need to take into account NA, so I added replace_na(). And we need to account for additional idno's by using group_by()

    A minimal vectorized solution is

    df$var1_test<-cummax(x=replace_na(df$var1, 0))
    

    This is a great problem to solve with the tidyverse mutate across function set!

    df = data.frame(idno         = c(123,123,123,123,123,123,123),
                      followup_num = c(0,1,2,3,4,5,6),
                      var1         = c(0,NA,0,1,0,NA,1),
                      var2         = c(1,NA,NA,0,0,0,1))
    
    df %>% group_by(idno) %>%  
           arrange(idno, followup_num) %>% 
           mutate(across(.cols=starts_with("var"), 
                         .fns= ~ cummax(tidyr::replace_na(.x, 0)), 
                         .names="{.col}_ever2"))
    
       idno followup_num  var1  var2 var1_ever2 var2_ever2
    1   123            0     0     1          0          1
    2   123            1    NA    NA          0          1
    3   123            2     0    NA          0          1
    4   123            3     1     0          1          1
    5   123            4     0     0          1          1
    6   123            5    NA     0          1          1
    7   123            6     1     1          1          1
    

    Alternativly, if you want to summarize the data to a single row, then a grouped max works

    df %>%
      group_by(idno) %>%
      summarise(across(.cols=starts_with("var"), 
                       .fns= ~ max(.x, na.rm=T), 
                       .names="{.col}_ever3"))
    
       idno var1_ever3 var2_ever3
    1   123          1          1
    

    ps. data is an internal function, better to call variable df.