Search code examples
rdateclosest

Faster way of find the closest dates of a vector to an element of another vector


I have several time vectors with different sizes and one time vector with secondly sampled.

I was trying to find the closest point to the element $i^{th}$ but this method is insanely slow.

    for (i in 1:length(SamplingTime)){
which.min(abs(SamplingTime[i]-rTime1))
}

Additionally I would like to know if someone knows how to find the two closest data points to the i element of SamplingTime. My original approach was to convert the posix format to numeric one and using RANN package with:

closest <- nn2(data=mytimes, k=2)[[1]]

But again it is to slow.

Edit:

    SampleTime                        rTime

2018-06-01 00:51:40   UTC    2018-06-01 00:51:37 UTC 
2018-06-01 00:51:41,2 UTC    2018-06-01 00:51:38 UTC 
2018-06-01 00:51:41,4 UTC    2018-06-01 00:51:39 UTC
2018-06-01 00:51:41,5 UTC    2018-06-01 00:51:40 UTC 
2018-06-01 00:51:41,9 UTC    2018-06-01 00:51:41 UTC 
2018-06-01 00:51:43   UTC    2018-06-01 00:51:42 UTC
2018-06-01 00:51:46   UTC    2018-06-01 00:51:43 UTC
2018-06-01 00:51:48   UTC            .
          .                          .
          .

The idea is that each time I have to evaluate which are the two values of rTime closer to SampleTime[i]. For instance for SampleTime [3]=2018-06-01 00:51:48 UTC the closer rTime would be rTime[4]=2018-06-01 00:51:40 UTC and rTime[5]=2018-06-01 00:51:41 UTC


Solution

  • The posted question contains two questions, actually. The first one asks for a faster method to find the closest value in rTime for each value given in SampleTime.

    The for loop of the OP "prints" the indices of the nearest value in rTime. (Well, actually the code snippet of the OP returns nothing without a print() statement or storing the values.)

    The code below returns the indices using a rolling join to nearest which is available with the data.table package.

    # reproduce OP's data
    SampleTime <- 
      structure(c(1527814300, 1527814301.2, 1527814301.4, 1527814301.5, 
                  1527814301.9, 1527814303, 1527814306, 1527814308), 
                class = c("POSIXct", "POSIXt"), tzone = "UTC")
    rTime <- 
      structure(c(1527814297, 1527814298, 1527814299, 1527814300, 1527814301, 
                  1527814302, 1527814303), 
                class = c("POSIXct", "POSIXt"), tzone = "UTC")
    
    library(data.table)
    sDT <- data.table(SampleTime)
    rDT <- data.table(rTime)
    # rolling join to nearest
    rDT[sDT, on = .(rTime = SampleTime), roll = "nearest", which = TRUE]
    
    [1] 4 5 5 5 6 7 7 7
    

    If the values are required instead of indices:

    sDT[, rTime := rDT[sDT, on = .(rTime = SampleTime), roll = "nearest", x.rTime]][]
    
                SampleTime               rTime
    1: 2018-06-01 00:51:40 2018-06-01 00:51:40
    2: 2018-06-01 00:51:41 2018-06-01 00:51:41
    3: 2018-06-01 00:51:41 2018-06-01 00:51:41
    4: 2018-06-01 00:51:41 2018-06-01 00:51:41
    5: 2018-06-01 00:51:41 2018-06-01 00:51:42
    6: 2018-06-01 00:51:43 2018-06-01 00:51:43
    7: 2018-06-01 00:51:46 2018-06-01 00:51:43
    8: 2018-06-01 00:51:48 2018-06-01 00:51:43
    

    Note, that fractional seconds and time zone information are omitted by default when printing POSIXct objects. To show both, a format needs to be specified:

    sDT[, rTime := rDT[sDT, on = .(rTime = SampleTime), roll = "nearest", x.rTime]][
      , lapply(.SD, format, format = "%F %H:%M:%OS1 %Z")]
    
                      SampleTime                     rTime
    1: 2018-06-01 00:51:40.0 UTC 2018-06-01 00:51:40.0 UTC
    2: 2018-06-01 00:51:41.2 UTC 2018-06-01 00:51:41.0 UTC
    3: 2018-06-01 00:51:41.4 UTC 2018-06-01 00:51:41.0 UTC
    4: 2018-06-01 00:51:41.5 UTC 2018-06-01 00:51:41.0 UTC
    5: 2018-06-01 00:51:41.9 UTC 2018-06-01 00:51:42.0 UTC
    6: 2018-06-01 00:51:43.0 UTC 2018-06-01 00:51:43.0 UTC
    7: 2018-06-01 00:51:46.0 UTC 2018-06-01 00:51:43.0 UTC
    8: 2018-06-01 00:51:48.0 UTC 2018-06-01 00:51:43.0 UTC
    

    Benchmark

    The benchmark compares three different methods

    • the for loop as used by the OP but modified to return a vector of indices
    • a more concise rewrite using sapply(), and
    • a rolling join to nearest

    All three return a vector of indices.

    The benchmark data consist of 1000 sample times which is a rather small test case.

    library(data.table)
    library(magrittr)
    # create benchmark data
    n <- 1000L
    set.seed(1L)
    SampleTime <- lubridate::as_datetime("2018-06-01") + cumsum(rnorm(n, 1)) %>% 
      sort()
    
    rTime <- seq(lubridate::floor_date(min(SampleTime), "min"),
                 lubridate::ceiling_date(max(SampleTime), "min"),
                 by = "sec")
    
    # perform benchmark
    microbenchmark::microbenchmark(
      loop = {
        idx <- integer(length(SampleTime))
        for (i in 1:length(SampleTime)){
          idx[i] <- (which.min(abs(SampleTime[i] - rTime)))
        }
        idx
      },
      sapply = {
        sapply(
          seq_along(SampleTime), 
          function(i) which.min(abs(SampleTime[i] - rTime))
        )
      },
      roll_join = {
        sDT <- data.table(SampleTime)
        rDT <- data.table(rTime)
        rDT[sDT, on = .(rTime = SampleTime), roll = "nearest", which = TRUE]
      },
      times = 100L
    )
    

    The rolling join is the fastest method by a factor of 50, even for this rather small benchmark case:

    Unit: milliseconds
          expr       min        lq      mean    median        uq        max neval cld
          loop 51.467338 53.365061 57.174145 54.722276 57.270950 214.442708   100   c
        sapply 49.833166 51.244187 53.600532 52.424695 55.126666  64.886196   100  b 
     roll_join  1.093099  1.355139  1.462512  1.408001  1.496544   5.411494   100 a