I have these two datasets exemplified below:
library(lubridate)
library(tidyverse)
#dataset 1
id <- c("A_1", "A_1", "A_1", "A_1", "A_1", "A_2", "A_2", "A_2", "A_2",
"A_2", "B_1", "B_1", "B_1", "B_1", "B_1", "B_2", "B_2", "B_2", "B_2",
"B_2")
date <- ymd_hms(c("2017-11-26 09:00:00", "2017-11-26 09:05:00", "2017-11-30 09:00:00",
"2017-11-30 09:05:00", "2017-12-02 09:00:00", "2017-11-26 09:00:00", "2017-11-26 09:05:00", "2017-11-30 09:00:00",
"2017-11-30 09:05:00", "2017-12-02 09:00:00", "2017-11-26 09:00:00", "2017-11-26 09:05:00", "2017-11-30 09:00:00",
"2017-11-30 09:05:00", "2017-12-02 09:00:00", "2017-11-26 09:00:00", "2017-11-26 09:05:00", "2017-11-30 09:00:00",
"2017-11-30 09:05:00", "2017-12-02 09:00:00"))
df <- tibble(id, date)
# A tibble: 20 x 2
id date
<chr> <dttm>
1 A_1 2017-11-26 09:00:00
2 A_1 2017-11-26 09:05:00
3 A_1 2017-11-30 09:00:00
4 A_1 2017-11-30 09:05:00
5 A_1 2017-12-02 09:00:00
6 A_2 2017-11-26 09:00:00
7 A_2 2017-11-26 09:05:00
8 A_2 2017-11-30 09:00:00
9 A_2 2017-11-30 09:05:00
10 A_2 2017-12-02 09:00:00
11 B_1 2017-11-26 09:00:00
12 B_1 2017-11-26 09:05:00
13 B_1 2017-11-30 09:00:00
14 B_1 2017-11-30 09:05:00
15 B_1 2017-12-02 09:00:00
16 B_2 2017-11-26 09:00:00
17 B_2 2017-11-26 09:05:00
18 B_2 2017-11-30 09:00:00
19 B_2 2017-11-30 09:05:00
20 B_2 2017-12-02 09:00:00
#dataset 2
id <- c("A", "A", "B", "B")
date <- ymd_hms(c("2017-11-26 09:01:30", "2017-11-30 09:06:40", "2017-11-30 09:04:50", "2017-12-02 09:01:00"))
variable1 <- c("67", "30", "28", "90")
variable2 <- c("x","y","z", "w")
df2 <- tibble(id, date, variable1, variable2)
# A tibble: 4 x 4
id date variable1 variable2
<chr> <dttm> <chr> <chr>
1 A 2017-11-26 09:01:30 67 x
2 A 2017-11-30 09:06:40 30 y
3 B 2017-11-30 09:04:50 28 z
4 B 2017-12-02 09:01:00 90 w
I first need to group by "id", then by "date and time", and then extract the columns of dataset 2 for the nearest hour in the dataset 1 (condition: for each row connect to previous maximum hour 5 min) creating new columns in the dataset 1.
But, each "id" in the data set 2, occurs 50 time in the dataset 1, herefore, an row present in dataset 1 probabilly will find an corresponding hour 50 times in the dataset 1 to same date. I need that, for each "id", this "extraction" is done the same number of times as there is a corresponding hour, even if it is frequent.
The resulting dataset would look like this:
df_output
# A tibble: 20 x 5
id date date2 variable1 variable2
<chr> <dttm> <chr> <chr> <chr>
1 A_1 2017-11-26 09:00:00 2017-11-26 09:01:30 67 x
2 A_1 2017-11-26 09:05:00 NA NA NA
3 A_1 2017-11-30 09:00:00 NA NA NA
4 A_1 2017-11-30 09:05:00 2017-11-30 09:06:40 30 y
5 A_1 2017-12-02 09:00:00 NA NA NA
6 A_2 2017-11-26 09:00:00 2017-11-26 09:01:30 67 x
7 A_2 2017-11-26 09:05:00 NA NA NA
8 A_2 2017-11-30 09:00:00 NA NA NA
9 A_2 2017-11-30 09:05:00 2017-11-30 09:06:40 30 y
10 A_2 2017-12-02 09:00:00 NA NA NA
11 B_1 2017-11-26 09:00:00 NA NA NA
12 B_1 2017-11-26 09:05:00 NA NA NA
13 B_1 2017-11-30 09:00:00 2017-11-30 09:04:50 28 z
14 B_1 2017-11-30 09:05:00 NA NA NA
15 B_1 2017-12-02 09:00:00 2017-12-02 09:01:00 90 w
16 B_2 2017-11-26 09:00:00 NA NA NA
17 B_2 2017-11-26 09:05:00 NA NA NA
18 B_2 2017-11-30 09:00:00 2017-11-30 09:04:50 28 z
19 B_2 2017-11-30 09:05:00 NA NA NA
20 B_2 2017-12-02 09:00:00 2017-12-02 09:01:00 90 w
note: I still need to consider that not all rows will have something corresponding in dataset2, therefore, these must be filled with NA's.
Thanks in advance.
We may use ceiling_date
from lubridate
to change the date to '5 min' interval. Then do a non-equi join with data.table
library(lubridate)
library(dplyr)
library(data.table)
df2new <- df2 %>%
mutate(date2 = ceiling_date(date, "5 min"),
date = floor_date(date, "5 min"))
setDT(df)[, id2:= trimws(id, whitespace = "_\\d+")][
setDT(df2new), c('date2', 'variable1', 'variable2') := .(date2,
variable1, variable2), on = .(id2 = id, date > date, date <= date2)]