Search code examples
roptimizationvectorizationnested-loopsrbind

Remove nested for loop with if condition in R


I am trying to optimize my R code by removing nested for loop with vectorization. My nested for loop include rbind based on if condition. Nested for loop code works however, when running vectorized code using rbind, doesn't fill the new dataframe.

For the background, I have two dataframes-'ip' , 'ip_error'. Data frame ‘ip’ with Dimension is ‘469 5’. Data frame ‘ip_error’ is with Dimension is ‘9 11’. After comparison of two data frames on the specific columns of task start and end with session start and end, my output is the selected rows from data frame ‘ip’.

This is my working code with nested for loop

for(j in 1:length(ip$RUID_KEY)){
 for(i in 1:length(ip_error$RUID_KEY)){
  if(isTRUE(ip_error$RUID_KEY[i]==ip$RUID_KEY[j]&&ip_error$TASK_START[i]>=ip$sess_start[j]&&ip_error$TASK_END[i]<ip$sess_end[j])){
    ev_ip_error<-rbind(ev_ip_error,ip[j,])
  }
}
}

My code with vectorization is as follows, which does not work

al<-1:length(ip$RUID_KEY)
bl<-1:length(ip_error$RUID_KEY)

f<- function(i,j){
  if(isTRUE(ip_error$RUID_KEY[i]==ip$RUID_KEY[j]&&ip_error$TASK_START[i]>=ip$sess_start[j]&&ip_error$TASK_END[i]<ip$sess_end[j])){
    ev_ip_error<-rbind(ev_ip_error,ip[j,])
  }
}

mapply(f,al,bl)

Here is example of my data frames, where for rows 1 and 3 in 'ip_error' satisfy the if condition

No.     RUID_KEY    sess_start  sess_end
1   101 2018-12-01 22:48:18.827 2018-12-01 22:55:18.900
2   201 2018-12-01 13:10:20.100 2018-12-01 13:50:10.000
3   201 2018-12-12 11:10:10.100 2018-12-12 11:20:00.100

‘ip_error’ data frame

No. RUID_KEY    TASK_START  TASK_END    TASK_NAME
1   101 2018-12-01 22:50:18.827 2018-12-01 22:50:18.827 ERROR1
2   101 2018-12-01 15:10:20.100 2018-12-01 15:10:20.100 ERROR2
3   201 2018-12-01 13:40:10.100 2018-12-01 13:40:10.100 ERROR1
ev_ip_error<-data.frame(matrix(ncol=5,nrow=0))
x<-c("RUID_KEY", "sess_start", "sess_end")
colnames(ev_ip_error)<-x

Solution

  • Consider merge of the two data frames and then subset by time:

    ev_ip_error <- subset(merge(ip, ip_error, by="RUID_KEY", suffixes=c("", "_")),
                          TASK_START >= sess_start & TASK_END < sess_end)[names(ip)]
    
    ev_ip_error
    
    #   No. RUID_KEY          sess_start            sess_end
    # 1   1      101 2018-12-01 22:48:18 2018-12-01 22:55:18
    # 3   2      201 2018-12-01 13:10:20 2018-12-01 13:50:10
    

    Which is equivalent to unadjusted for loop and corrected mapply (or Map) approach that builds a list of data frames with expand.grid (for all possible combinations between RUID_KEY values). Since apply family solutions do not save scoped variables you need to build object outside its loop or call rbind once outside loop. This would be more efficient than for loop. See below:

    prms <- expand.grid(al = 1:length(ip$RUID_KEY),
                        bl = 1:length(ip_error$RUID_KEY))
    
    f <- function(i,j){
      if(isTRUE(ip_error$RUID_KEY[i]==ip$RUID_KEY[j] && ip_error$TASK_START[i]>=ip$sess_start[j] && ip_error$TASK_END[i]<ip$sess_end[j])){
         return(ip[j,])
      }
    }
    
    df_list <- mapply(f, prms$al, prms$bl, SIMPLIFY = FALSE)
    #df_list <- Map(f, prms$al, prms$bl)   # EQUIVALENT
    
    ev_ip_error <- do.call(rbind, df_list)
    

    See comparison of all three approaches in Online Demo.