I have a nested for
loop that is running through every line in an 800,000+ observation data.frame
called alltrx
. It checks whether each value in the Posixct
formatted column TIME
in alltrx
is within a date interval created from values in a second data.frame
called long
. If the condition is true the loop takes the row number from the "correct" line of which the date interval was obtained from long
and puts it into a new column alltrx$Survey
. As far as I can tell it is working, but I tried with a progress bar and only got through 4% in one hour. I would like to ask if anyone has ideas on how to speed this up? I also have alltrx
as a list of lists where the data is divided up among the lists (in a way that makes sense to my project).
I have looked at this and this but it didn't seem to provide much chance of improving the speed the second being not much help at all...
....see my code below
for (i in 1:nrow(alltrx)){
for (j in 1:nrow(long)) {
if(alltrx$TIME2[i] %within%
(interval(ymd(long$V2[j]), ymd(long$V1[j])))){alltrx$Survey[i]<-row.name(long[j])}
}
}
I am interested in improving either my loop code or other coding alternatives that may have better performance such as lapply
thanks for any and all help. I haven't provided a data set as it doesn't seem necessary but can do so if someone points out the importance of having it.
In the end this process was made significantly faster (5 minutes compared to ~24 hours) by processing the large data frame as a series of lists which (I think) allows multiple processes to occur at once. Where the date interval was concerned fuzzyjoin::left_fuzzy_join()
was useful in outlining the conditions where the match_fun
argument allows you to determine the conditional relationship with each 2 variables selected in the by
argument. Hope someone out there finds this interesting or even useful :p
To complete the example below is the running command from my script. Note: the original alltrx
file was divided into a list of data.frame
s called alltrx.list
fuzz.join <- lapply(alltrx.list, function(x){
fuzzy_left_join(x, long, by = c("TIME" = "V1", "TIME" = "V2")
, match_fun = list(`>=`, `<=`))
})