Search code examples
rdplyrlubridatefuzzyjoin

Match two tables based on a time difference criterium


I have a data table (lv_timest) with time stamps every 3 hours for each date:

# A tibble: 6 × 5
     LV0_mean LV1_mean  LV2_mean Date_time           Date      
     <dbl>    <dbl>     <dbl>    <S3:POSIXct>        <date>    
1    0.778    -4.12     0.736    2016-12-28 00:00:00 2016-12-28
2    0.376    -0.234    0.388    2016-12-28 03:00:00 2016-12-28
3    0.409    1.46      0.241    2016-12-28 06:00:00 2016-12-28
4    0.760    2.07      0.460    2016-12-28 09:00:00 2016-12-28
5    0.759    2.91      0.735    2016-12-28 12:00:00 2016-12-28
6    0.857    3.00      0.803    2016-12-28 15:00:00 2016-12-28

from which I would like to extract the time stamps that match as closely as possible those of another table (event_timest):

# A tibble: 6 × 4
   Event_number Date_time           Date       Date_time_new
   <int>        <S3: POSIXct>       <date>     <S3: POSIXct>
1  75           2016-12-28 08:00:00 2016-12-28 2016-12-28 08:00:00
2  123          2016-12-30 14:02:00 2016-12-30 2016-12-30 14:00:00
3  264          2017-01-07 06:12:00 2017-01-07 2017-01-07 06:00:00
4  317          2017-01-09 10:59:00 2017-01-09 2017-01-09 11:00:00
5  318          2017-01-09 13:31:00 2017-01-09 2017-01-09 14:00:00
6  369          2017-01-11 07:24:00 2017-01-11 2017-01-11 07:00:00

For example, for row 1 in table event_timest, I would extract row 4 from table lv_timest:

Event_number Date_time.x          Date.x      Date_time_new      LV0_mean LV1_mean   LV2_mean Date_time.y          Date.y
<int>        <S3: POSIXct>        <date>      <S3: POSIXct>      <dbl>    <dbl>      <dbl>    <S3: POSIXct>        <date>                         
75           2016-12-28 08:00:00  2016-12-28 2016-12-28 08:00:00 0.760    2.07       0.460    2016-12-28 09:00:00  2016-12-28

In fact, the time difference should not be over one hour. I thought of using the fuzzyjoin package for this, and writing a function that computes the time difference between timestamps of the two table, as hours. However, fuzzy_inner_join replicates rows in the second table and takes several timestamps in the first table to match it.

require(lubridate)
require(fuzzyjoin)

diff_timest <- function(x, y){abs(x%--%y %/% hours(1)) <= 1} # time interval as hours ≤ 1 hour

match_timest <- fuzzy_inner_join(event_timest, lv_timest,
                                 by = c("Date" = "Date",
                                        "Date_time_new" = "Date_time"),
                                 match_fun = list(`==`, diff_timest))
head(match_timest)

# A tibble: 6 × 9
  Event_number Date_time.x         Date.x     Date_time_new       LV0_mean LV1_mean LV2_mean Date_time.y         Date.y    
         <int> <dttm>              <date>     <dttm>                 <dbl>    <dbl>    <dbl> <dttm>              <date>    
1           75 2016-12-28 08:00:00 2016-12-28 2016-12-28 08:00:00   0.760     2.07     0.460 2016-12-28 09:00:00 2016-12-28
2          123 2016-12-30 14:02:00 2016-12-30 2016-12-30 14:00:00   1.24      1.83     2.05  2016-12-30 15:00:00 2016-12-30
3          264 2017-01-07 06:12:00 2017-01-07 2017-01-07 06:00:00  -0.128    -5.43     2.72  2017-01-07 06:00:00 2017-01-07
4          317 2017-01-09 10:59:00 2017-01-09 2017-01-09 11:00:00  -0.0751    0.171    2.56  2017-01-09 09:00:00 2017-01-09
5          317 2017-01-09 10:59:00 2017-01-09 2017-01-09 11:00:00  -0.204    -0.797    2.28  2017-01-09 12:00:00 2017-01-09
6          318 2017-01-09 13:31:00 2017-01-09 2017-01-09 14:00:00  -0.204    -0.797    2.28  2017-01-09 12:00:00 2017-01-09

Would there be another way to do this?


Solution

  • I would suggest a standard join, followed by a grouped filter to the closest instance of each timestamp:

    library(tidyverse)
    library(lubridate)
    
    match_timest <- event_timest %>%
      inner_join(lv_timest, by = "Date") %>%
      mutate(diff = abs(as.numeric(Date_time.x - Date_time.y, unit = "hours"))) %>%
      group_by(Date_time.y) %>%
      filter(diff <= 1 & diff == min(diff)) %>%
      ungroup() %>%
      select(!diff)
    

    Note:

    • this will still match multiple rows if there’s more than one that are the exact same shortest difference from the index timestamp.
    • this won’t match timestamps from different dates — eg, 23:59:59 on 1/1/22 won’t be matched with 00:00:00 on 1/2/22. If you’d like to do that, you can use a full Cartesian join (full_join(lv_timest, by = character())) rather than the inner_join() above.