Search code examples
rdatefor-looplapplyposixct

Is it possible to improve loop performance or improve completion speed with lapply alternatives


I have a nested for loop that is running through every line in an 800,000+ observation data.frame called alltrx. It checks whether each value in the Posixct formatted column TIME in alltrx is within a date interval created from values in a second data.frame called long. If the condition is true the loop takes the row number from the "correct" line of which the date interval was obtained from long and puts it into a new column alltrx$Survey. As far as I can tell it is working, but I tried with a progress bar and only got through 4% in one hour. I would like to ask if anyone has ideas on how to speed this up? I also have alltrx as a list of lists where the data is divided up among the lists (in a way that makes sense to my project).

I have looked at this and this but it didn't seem to provide much chance of improving the speed the second being not much help at all...

....see my code below


for (i in 1:nrow(alltrx)){ 
  for (j in 1:nrow(long)) {
    if(alltrx$TIME2[i] %within% 
       (interval(ymd(long$V2[j]), ymd(long$V1[j])))){alltrx$Survey[i]<-row.name(long[j])}
  }
}

I am interested in improving either my loop code or other coding alternatives that may have better performance such as lapply thanks for any and all help. I haven't provided a data set as it doesn't seem necessary but can do so if someone points out the importance of having it.


Solution

  • In the end this process was made significantly faster (5 minutes compared to ~24 hours) by processing the large data frame as a series of lists which (I think) allows multiple processes to occur at once. Where the date interval was concerned fuzzyjoin::left_fuzzy_join() was useful in outlining the conditions where the match_fun argument allows you to determine the conditional relationship with each 2 variables selected in the by argument. Hope someone out there finds this interesting or even useful :p

    To complete the example below is the running command from my script. Note: the original alltrx file was divided into a list of data.frames called alltrx.list

    
     fuzz.join <- lapply(alltrx.list, function(x){
          fuzzy_left_join(x, long, by = c("TIME" = "V1", "TIME" = "V2")
        , match_fun = list(`>=`, `<=`))
        })