I have this dataset exemplified below:
library(lubridate)
library(tidyverse)
id <- c("A", "A", "B", "B")
date <- ymd(c("2017-11-26", "2017-11-26", "2017-11-26", "2017-11-29"))
time <- hms(c("09:25:30", "10:10:02", "09:15:36", "9:50:55"))
variable1 <- c("67", "30", "28", "90")
df <- tibble(id, date, time, variable1)
df
A tibble: 4 x 4
id date time variable1
<chr> <date> <Period> <chr>
1 A 2017-11-26 9H 25M 30S 67
2 A 2017-11-26 10H 10M 2S 30
3 B 2017-11-26 9H 15M 36S 28
4 B 2017-11-29 9H 50M 55S 90
I need to add for each row in this dataset two new variables (latitude and longitude) based on a second dataset, using the columns "id", "date" and "time" as conditions to create the new columns.
However, the "time" is not exactly the same between the two datasets, and thus I need this condition to be an approximation (i.e. the nearest 'time' available for a given 'id' and 'date'). Another issue is that the second dataset has more rows, and thus, by selecting the 'nearest time', some rows will be ignored when adding to the first dataset.
Please see this example for the second dataset:
id <- c("A", "A", "A", "B", "B", "B")
date <- ymd(c("2017-11-26", "2017-11-26", "2017-11-30", "2017-11-26",
"2017-11-26", "2017-11-29"))
time <- hms(c("09:00:00", "10:00:00", "08:00:00", "09:00:00",
"13:00:00", "10:00:00"))
lat <- c(-30.6456, -29.5648, -27.6667, -31.5587, -30.6934, -29.3147)
long <- c(-50.4879, -49.8715, -51.8716, -50.4456, -50.9842, -51.9787)
df2 <- tibble(id, date, time, lat, long)
df2
# A tibble: 6 x 5
id date time lat long
<chr> <date> <Period> <dbl> <dbl>
1 A 2017-11-26 9H 0M 0S -30.6 -50.5
2 A 2017-11-26 10H 0M 0S -29.6 -49.9
3 A 2017-11-30 8H 0M 0S -27.7 -51.9
4 B 2017-11-26 9H 0M 0S -31.6 -50.4
5 B 2017-11-26 13H 0M 0S -30.7 -51.0
6 B 2017-11-29 10H 0M 0S -29.3 -52.0
Therefore, I first need to group by "id", then by "date", and then extract the "lat" and "long" in the dataset 2 for the nearest "time" in the dataset 1, creating the two new columns in the dataset 1.
The resulting dataset would look like this:
df_output
# A tibble: 4 x 6
id date time variable1 lat long
<chr> <date> <Period> <chr> <dbl> <dbl>
1 A 2017-11-26 9H 25M 30S 67 -30.6 -50.5
2 A 2017-11-26 10H 10M 2S 30 -29.6 -49.9
3 B 2017-11-26 9H 15M 36S 28 -31.6 -50.4
4 B 2017-11-29 9H 50M 55S 90 -29.3 -52.0
I've tried with group_by(), map2_dfr(), merge(), left_join(), case_when(), etc.. but couldn't find how to do this.
You may use data.table
like
library(data.table)
df <- df %>%
mutate(time = paste(date, time) %>% ymd_hms()) %>%
select(-date)
df2 <- df2 %>%
mutate(time = paste(date, time) %>% ymd_hms()) %>%
select(-date)
df <- data.table(df, key = c("id", "time"))
df2 <- data.table(df2, key = c("id", "time"))
df2[df, list(id, time, variable1, lat, long), roll = "nearest"]
id time variable1 lat long
1: A 2017-11-26 09:25:30 67 -30.6456 -50.4879
2: A 2017-11-26 10:10:02 30 -29.5648 -49.8715
3: B 2017-11-26 09:15:36 28 -31.5587 -50.4456
4: B 2017-11-29 09:50:55 90 -29.3147 -51.9787
I'm not sure about variable1
's condition, why it's different between df
and df_output
in your question.