Search code examples
rquicksortlarge-dataclosest

What is an efficient programming way to find the closest time of a dataset to a reference (larger) dataset


I am searching for an efficient way to find the closest times of a small dataset (x) in comparison to a large dataset (a). The result has to be an index of the length of (a). I have already created a function which works very nicely, however, it is absolutely useless for large data as it takes days to process.

Here is my function: function(x, a, which = TRUE,na.rm=FALSE){
  if("POSIXt" %in% class(x)) x <- as.numeric(x)
  if("POSIXt" %in% class(a)) a <- as.numeric(a)
  sapply(a, function(y) DescTools::Closest(x, y, which = TRUE,na.rm=FALSE)[1])
}

both datasets x and a are filtered and therefore have no consistent time stemp but they are filtered after the same requirements.

vector a contains 20 Hz data with a length of 16020209 and x contains 30 sec data with a length of 26908.

Any suggestions are very much appreciated! Thank you :)


Solution

  • One can use a rolling join from data.table:

    library(data.table)
    set.seed(1)  # reproduciblity on Stackoverflow
    DF_A <- data.table(x = seq(-500, by = 0.5, length.out = 26908),
                       idx = seq_len(26908))
    
    DF_HZ <- data.table(x = round(runif(16020209, first(DF_A$x), last(DF_A$x)), 3),
                        idx_hz = seq_len(16020209))
    
    DF_HZ[, x_hz := x + 0] # so we can check
    DF_A[, x_a := x + 0] # so we can check
    
    
    setkey(DF_A, x)
    setkey(DF_HZ, x)
    
    # The order(idx_hz) returns the result in the same order as 
    # DF_HZ but it is not necessary to match joins.
    DF_A[DF_HZ, roll = "nearest"][order(idx_hz)]
    #>                   x   idx     x_a   idx_hz      x_hz
    #>        1:  3072.021  7145  3072.0        1  3072.021
    #>        2:  4506.369 10014  4506.5        2  4506.369
    #>        3:  7206.883 15415  7207.0        3  7206.883
    #>        4: 11718.574 24438 11718.5        4 11718.574
    #>        5:  2213.328  5428  2213.5        5  2213.328
    #>       ---                                           
    #> 16020205: 10517.477 22036 10517.5 16020205 10517.477
    #> 16020206: 11407.776 23817 11408.0 16020206 11407.776
    #> 16020207: 12051.919 25105 12052.0 16020207 12051.919
    #> 16020208:  3482.463  7966  3482.5 16020208  3482.463
    #> 16020209:   817.366  2636   817.5 16020209   817.366
    

    Created on 2020-11-11 by the reprex package (v0.3.0)

    On my machine, the above (not including the creation of the dummy data) takes about 3 s.