Search code examples
rjoinmergedata-manipulationtibble

How to join tibbles/dataframes with different row numbers by using the ID and interval date/time?


I have these two datasets exemplified below:

library(lubridate)
library(tidyverse)

#dataset 1

id <- c("A_1", "A_1", "A_1", "A_1", "A_1", "A_2", "A_2", "A_2", "A_2", 
        "A_2", "B_1", "B_1", "B_1", "B_1", "B_1", "B_2", "B_2", "B_2", "B_2", 
        "B_2")
date <- ymd_hms(c("2017-11-26 09:00:00", "2017-11-26 09:05:00", "2017-11-30 09:00:00", 
                  "2017-11-30 09:05:00", "2017-12-02 09:00:00", "2017-11-26 09:00:00", "2017-11-26 09:05:00", "2017-11-30 09:00:00", 
                  "2017-11-30 09:05:00", "2017-12-02 09:00:00", "2017-11-26 09:00:00", "2017-11-26 09:05:00", "2017-11-30 09:00:00", 
                  "2017-11-30 09:05:00", "2017-12-02 09:00:00", "2017-11-26 09:00:00", "2017-11-26 09:05:00", "2017-11-30 09:00:00", 
                  "2017-11-30 09:05:00", "2017-12-02 09:00:00"))    

df <- tibble(id, date)

# A tibble: 20 x 2
   id    date               
   <chr> <dttm>             
 1 A_1   2017-11-26 09:00:00
 2 A_1   2017-11-26 09:05:00
 3 A_1   2017-11-30 09:00:00
 4 A_1   2017-11-30 09:05:00
 5 A_1   2017-12-02 09:00:00
 6 A_2   2017-11-26 09:00:00
 7 A_2   2017-11-26 09:05:00
 8 A_2   2017-11-30 09:00:00
 9 A_2   2017-11-30 09:05:00
10 A_2   2017-12-02 09:00:00
11 B_1   2017-11-26 09:00:00
12 B_1   2017-11-26 09:05:00
13 B_1   2017-11-30 09:00:00
14 B_1   2017-11-30 09:05:00
15 B_1   2017-12-02 09:00:00
16 B_2   2017-11-26 09:00:00
17 B_2   2017-11-26 09:05:00
18 B_2   2017-11-30 09:00:00
19 B_2   2017-11-30 09:05:00
20 B_2   2017-12-02 09:00:00

#dataset 2

id <- c("A", "A", "B", "B")
date <- ymd_hms(c("2017-11-26 09:01:30", "2017-11-30 09:06:40", "2017-11-30 09:04:50", "2017-12-02 09:01:00"))
variable1 <- c("67", "30", "28", "90")
variable2 <- c("x","y","z", "w")
df2 <- tibble(id, date, variable1, variable2)

# A tibble: 4 x 4
  id    date                variable1 variable2
  <chr> <dttm>              <chr>     <chr>    
1 A     2017-11-26 09:01:30 67        x        
2 A     2017-11-30 09:06:40 30        y        
3 B     2017-11-30 09:04:50 28        z        
4 B     2017-12-02 09:01:00 90        w        

I first need to group by "id", then by "date and time", and then extract the columns of dataset 2 for the nearest hour in the dataset 1 (condition: for each row connect to previous maximum hour 5 min) creating new columns in the dataset 1.

But, each "id" in the data set 2, occurs 50 time in the dataset 1, herefore, an row present in dataset 1 probabilly will find an corresponding hour 50 times in the dataset 1 to same date. I need that, for each "id", this "extraction" is done the same number of times as there is a corresponding hour, even if it is frequent.

The resulting dataset would look like this:

df_output
# A tibble: 20 x 5
   id    date                date2               variable1 variable2
   <chr> <dttm>              <chr>               <chr>     <chr>    
 1 A_1   2017-11-26 09:00:00 2017-11-26 09:01:30 67        x        
 2 A_1   2017-11-26 09:05:00 NA                  NA        NA       
 3 A_1   2017-11-30 09:00:00 NA                  NA        NA       
 4 A_1   2017-11-30 09:05:00 2017-11-30 09:06:40 30        y        
 5 A_1   2017-12-02 09:00:00 NA                  NA        NA       
 6 A_2   2017-11-26 09:00:00 2017-11-26 09:01:30 67        x        
 7 A_2   2017-11-26 09:05:00 NA                  NA        NA       
 8 A_2   2017-11-30 09:00:00 NA                  NA        NA       
 9 A_2   2017-11-30 09:05:00 2017-11-30 09:06:40 30        y        
10 A_2   2017-12-02 09:00:00 NA                  NA        NA       
11 B_1   2017-11-26 09:00:00 NA                  NA        NA       
12 B_1   2017-11-26 09:05:00 NA                  NA        NA       
13 B_1   2017-11-30 09:00:00 2017-11-30 09:04:50 28        z        
14 B_1   2017-11-30 09:05:00 NA                  NA        NA       
15 B_1   2017-12-02 09:00:00 2017-12-02 09:01:00 90        w        
16 B_2   2017-11-26 09:00:00 NA                  NA        NA       
17 B_2   2017-11-26 09:05:00 NA                  NA        NA       
18 B_2   2017-11-30 09:00:00 2017-11-30 09:04:50 28        z        
19 B_2   2017-11-30 09:05:00 NA                  NA        NA       
20 B_2   2017-12-02 09:00:00 2017-12-02 09:01:00 90        w 

note: I still need to consider that not all rows will have something corresponding in dataset2, therefore, these must be filled with NA's.

Thanks in advance.


Solution

  • We may use ceiling_date from lubridate to change the date to '5 min' interval. Then do a non-equi join with data.table

    library(lubridate)
    library(dplyr)
    library(data.table)
    df2new <- df2 %>%
       mutate(date2 = ceiling_date(date, "5 min"), 
              date = floor_date(date, "5 min"))
    setDT(df)[, id2:= trimws(id, whitespace = "_\\d+")][
       setDT(df2new), c('date2', 'variable1', 'variable2') := .(date2,  
        variable1, variable2), on = .(id2 = id, date > date, date <= date2)]