I am trying to optimize my R code by removing nested for loop with vectorization. My nested for loop include rbind based on if condition. Nested for loop code works however, when running vectorized code using rbind, doesn't fill the new dataframe.
For the background, I have two dataframes-'ip' , 'ip_error'. Data frame ‘ip’ with Dimension is ‘469 5’. Data frame ‘ip_error’ is with Dimension is ‘9 11’. After comparison of two data frames on the specific columns of task start and end with session start and end, my output is the selected rows from data frame ‘ip’.
This is my working code with nested for loop
for(j in 1:length(ip$RUID_KEY)){
for(i in 1:length(ip_error$RUID_KEY)){
if(isTRUE(ip_error$RUID_KEY[i]==ip$RUID_KEY[j]&&ip_error$TASK_START[i]>=ip$sess_start[j]&&ip_error$TASK_END[i]<ip$sess_end[j])){
ev_ip_error<-rbind(ev_ip_error,ip[j,])
}
}
}
My code with vectorization is as follows, which does not work
al<-1:length(ip$RUID_KEY)
bl<-1:length(ip_error$RUID_KEY)
f<- function(i,j){
if(isTRUE(ip_error$RUID_KEY[i]==ip$RUID_KEY[j]&&ip_error$TASK_START[i]>=ip$sess_start[j]&&ip_error$TASK_END[i]<ip$sess_end[j])){
ev_ip_error<-rbind(ev_ip_error,ip[j,])
}
}
mapply(f,al,bl)
Here is example of my data frames, where for rows 1 and 3 in 'ip_error' satisfy the if condition
No. RUID_KEY sess_start sess_end
1 101 2018-12-01 22:48:18.827 2018-12-01 22:55:18.900
2 201 2018-12-01 13:10:20.100 2018-12-01 13:50:10.000
3 201 2018-12-12 11:10:10.100 2018-12-12 11:20:00.100
‘ip_error’ data frame
No. RUID_KEY TASK_START TASK_END TASK_NAME
1 101 2018-12-01 22:50:18.827 2018-12-01 22:50:18.827 ERROR1
2 101 2018-12-01 15:10:20.100 2018-12-01 15:10:20.100 ERROR2
3 201 2018-12-01 13:40:10.100 2018-12-01 13:40:10.100 ERROR1
ev_ip_error<-data.frame(matrix(ncol=5,nrow=0))
x<-c("RUID_KEY", "sess_start", "sess_end")
colnames(ev_ip_error)<-x
Consider merge
of the two data frames and then subset
by time:
ev_ip_error <- subset(merge(ip, ip_error, by="RUID_KEY", suffixes=c("", "_")),
TASK_START >= sess_start & TASK_END < sess_end)[names(ip)]
ev_ip_error
# No. RUID_KEY sess_start sess_end
# 1 1 101 2018-12-01 22:48:18 2018-12-01 22:55:18
# 3 2 201 2018-12-01 13:10:20 2018-12-01 13:50:10
Which is equivalent to unadjusted for
loop and corrected mapply
(or Map
) approach that builds a list of data frames with expand.grid
(for all possible combinations between RUID_KEY
values). Since apply family solutions do not save scoped variables you need to build object outside its loop or call rbind
once outside loop. This would be more efficient than for
loop. See below:
prms <- expand.grid(al = 1:length(ip$RUID_KEY),
bl = 1:length(ip_error$RUID_KEY))
f <- function(i,j){
if(isTRUE(ip_error$RUID_KEY[i]==ip$RUID_KEY[j] && ip_error$TASK_START[i]>=ip$sess_start[j] && ip_error$TASK_END[i]<ip$sess_end[j])){
return(ip[j,])
}
}
df_list <- mapply(f, prms$al, prms$bl, SIMPLIFY = FALSE)
#df_list <- Map(f, prms$al, prms$bl) # EQUIVALENT
ev_ip_error <- do.call(rbind, df_list)
See comparison of all three approaches in Online Demo.