What is an efficient programming way to find the closest time of a dataset to a reference (larger) dataset

I am searching for an efficient way to find the closest times of a small dataset (x) in comparison to a large dataset (a). The result has to be an index of the length of (a). I have already created a function which works very nicely, however, it is absolutely useless for large data as it takes days to process.

Here is my function: function(x, a, which = TRUE,na.rm=FALSE){
  if("POSIXt" %in% class(x)) x <- as.numeric(x)
  if("POSIXt" %in% class(a)) a <- as.numeric(a)
  sapply(a, function(y) DescTools::Closest(x, y, which = TRUE,na.rm=FALSE)[1])
}

both datasets x and a are filtered and therefore have no consistent time stemp but they are filtered after the same requirements.

vector a contains 20 Hz data with a length of 16020209 and x contains 30 sec data with a length of 26908.

Any suggestions are very much appreciated! Thank you :)

Solution

One can use a rolling join from data.table:

library(data.table)
set.seed(1)  # reproduciblity on Stackoverflow
DF_A <- data.table(x = seq(-500, by = 0.5, length.out = 26908),
                   idx = seq_len(26908))

DF_HZ <- data.table(x = round(runif(16020209, first(DF_A$x), last(DF_A$x)), 3),
                    idx_hz = seq_len(16020209))

DF_HZ[, x_hz := x + 0] # so we can check
DF_A[, x_a := x + 0] # so we can check


setkey(DF_A, x)
setkey(DF_HZ, x)

# The order(idx_hz) returns the result in the same order as 
# DF_HZ but it is not necessary to match joins.
DF_A[DF_HZ, roll = "nearest"][order(idx_hz)]
#>                   x   idx     x_a   idx_hz      x_hz
#>        1:  3072.021  7145  3072.0        1  3072.021
#>        2:  4506.369 10014  4506.5        2  4506.369
#>        3:  7206.883 15415  7207.0        3  7206.883
#>        4: 11718.574 24438 11718.5        4 11718.574
#>        5:  2213.328  5428  2213.5        5  2213.328
#>       ---                                           
#> 16020205: 10517.477 22036 10517.5 16020205 10517.477
#> 16020206: 11407.776 23817 11408.0 16020206 11407.776
#> 16020207: 12051.919 25105 12052.0 16020207 12051.919
#> 16020208:  3482.463  7966  3482.5 16020208  3482.463
#> 16020209:   817.366  2636   817.5 16020209   817.366

^{Created on 2020-11-11 by the reprex package (v0.3.0)}

On my machine, the above (not including the creation of the dummy data) takes about 3 s.