I am searching for an efficient way to find the closest times of a small dataset (x) in comparison to a large dataset (a). The result has to be an index of the length of (a). I have already created a function which works very nicely, however, it is absolutely useless for large data as it takes days to process.
Here is my function: function(x, a, which = TRUE,na.rm=FALSE){
if("POSIXt" %in% class(x)) x <- as.numeric(x)
if("POSIXt" %in% class(a)) a <- as.numeric(a)
sapply(a, function(y) DescTools::Closest(x, y, which = TRUE,na.rm=FALSE)[1])
}
both datasets x and a are filtered and therefore have no consistent time stemp but they are filtered after the same requirements.
vector a contains 20 Hz data with a length of 16020209 and x contains 30 sec data with a length of 26908.
Any suggestions are very much appreciated! Thank you :)
One can use a rolling join from data.table
:
library(data.table)
set.seed(1) # reproduciblity on Stackoverflow
DF_A <- data.table(x = seq(-500, by = 0.5, length.out = 26908),
idx = seq_len(26908))
DF_HZ <- data.table(x = round(runif(16020209, first(DF_A$x), last(DF_A$x)), 3),
idx_hz = seq_len(16020209))
DF_HZ[, x_hz := x + 0] # so we can check
DF_A[, x_a := x + 0] # so we can check
setkey(DF_A, x)
setkey(DF_HZ, x)
# The order(idx_hz) returns the result in the same order as
# DF_HZ but it is not necessary to match joins.
DF_A[DF_HZ, roll = "nearest"][order(idx_hz)]
#> x idx x_a idx_hz x_hz
#> 1: 3072.021 7145 3072.0 1 3072.021
#> 2: 4506.369 10014 4506.5 2 4506.369
#> 3: 7206.883 15415 7207.0 3 7206.883
#> 4: 11718.574 24438 11718.5 4 11718.574
#> 5: 2213.328 5428 2213.5 5 2213.328
#> ---
#> 16020205: 10517.477 22036 10517.5 16020205 10517.477
#> 16020206: 11407.776 23817 11408.0 16020206 11407.776
#> 16020207: 12051.919 25105 12052.0 16020207 12051.919
#> 16020208: 3482.463 7966 3482.5 16020208 3482.463
#> 16020209: 817.366 2636 817.5 16020209 817.366
Created on 2020-11-11 by the reprex package (v0.3.0)
On my machine, the above (not including the creation of the dummy data) takes about 3 s.