The problem: I have two datasets that are in essence time sequences with something measured at each timepoint. Df1 has a sample point every 30 mins, while df2 has a sample point every 5 min. I want to join the two datasets but to do that I need to summarise df2 so it also has only 1 value per 30 min. However, these intervals are not always constant and therefore this is the solution I've come up to, but I need to then be able to do this for every row of df1. I think there are purrr
functions that will allow me to map a custom function row wise, but I haven't quite figured out how to.
This is what I've done so far
0 - Packages
library(lubridate)
library(tidyverse)
library(wakefield)
library(magrittr)
1 - Generate dummy datasets for example
df1 <- data.frame(date_time = format(seq(as.POSIXct("2021-01-01 00:00:00", tz="GMT"),
as.POSIXct("2021-01-01 02:30:00", tz="GMT"),
by='30 min'), "%d-%m-%Y %H:%M:%S"),
other_vars = upper(6, k = 5, x = LETTERS, prob = NULL, name = "Upper"))
df2 <- data.frame(date_time = format(seq(as.POSIXct("2021-01-01 00:00:00", tz="GMT"),
as.POSIXct("2021-01-01 02:30:00", tz="GMT"),
by='5 min'), "%d-%m-%Y %H:%M:%S"),
num_var1 = runif(31, 0, 255),
num_var2 = runif(31, 0, 255))
2 - Add time at row i+1 as new column in df1
df1 %<>%
mutate(date_time_lead = lead(date_time))
3 - Subset rows of df2 with time points between time i and i+1 (in this case i == 1) of df1 and summarise the values (mean, sd, sum)
df2_sub <- df2 %>%
filter(date_time >= df1$date_time[1] & date_time < df1$date_time_lead[1]) %>%
summarise(var1_mean = mean(num_var1, na.rm = T),
var1_sd = sd(num_var1, na.rm = T),
var1_sum = sum(num_var1, na.rm = T),
var2_mean = mean(num_var2, na.rm = T),
var2_sd = sd(num_var2, na.rm = T),
var2_sum = sum(num_var2, na.rm = T))
Now, ideally I'd put all of this in a custom function and then use one of the purrr
functions to apply that function to each row of df1, but I haven't managed to figure out how to, so this is the next step
4 - Generate new columns in df1
df1$var1_mean <- NA
df1$var1_sd <- NA
df1$var1_sum <- NA
df1$var2_mean <- NA
df1$var2_sd <- NA
df1$var2_sum <- NA
5 - Add the newly generated summarised values to the ith row (in this case the first) of df1
df1[1, 4:9] <- df2_sub[1,]
Are you looking for something like this?
Hard to tell if it is correct, since your sample data is not minimal, and I do not get the use of logical operators in the line:
date_time > df1$date_time[1] & date_time < df1$date_time_lead[1])
Which suggests that in the interval 0:00-0:30 you want to exclude the values of 0:00 and 0:30 ?
library(data.table)
# set to data.table format
setDT(df1)
setDT(df2)
# perform rolling join, to roll down to the nearest date_from
df2[, c("date_time_df1", "other_vars") := df1[df2, .(x.date_time, x.other_vars), on = .(date_time), roll = Inf]]
# melt to long
df2.melt <- melt(df2, measure.vars = patterns("^num_var"))
# summarise
df2.melt[, .(mean = mean(value, na.rm = TRUE),
sd = sd(value, na.rm = TRUE),
sum = sum(value, na.rm = TRUE)),
by = .(date_time = date_time_df1, other_vars, variable)]
# date_time other_vars variable mean sd sum
# 1: 01-01-2021 00:00:00 B num_var1 183.51225 57.36997 1101.0735
# 2: 01-01-2021 00:30:00 B num_var1 96.46239 69.12586 578.7744
# 3: 01-01-2021 01:00:00 D num_var1 163.74048 84.57741 982.4429
# 4: 01-01-2021 01:30:00 A num_var1 120.74330 41.14782 724.4598
# 5: 01-01-2021 02:00:00 E num_var1 109.74861 85.56809 658.4917
# ---
# 34942: 30-12-2021 22:00:00 E num_var2 140.67153 79.74211 844.0292
# 34943: 30-12-2021 22:30:00 B num_var2 101.25896 75.61385 607.5538
# 34944: 30-12-2021 23:00:00 E num_var2 77.86649 50.45125 467.1989
# 34945: 30-12-2021 23:30:00 D num_var2 152.96687 66.00130 917.8012
# 34946: 31-12-2021 00:00:00 E num_var2 245.89731 NA 245.8973
tidyverse solution
using dplyr >= v1.1.0
df2 %>%
left_join(df1, join_by(closest(date_time >= date_time))) %>%
group_by(date_time = date_time.y) %>%
summarise(across(starts_with("num_"), list(mean = mean, sd = sd, sum = sum)))
# A tibble: 6 × 7
# date_time num_var1_mean num_var1_sd num_var1_sum num_var2_mean num_var2_sd num_var2_sum
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 01-01-2021 00:00:00 149. 88.6 896. 116. 91.5 697.
# 2 01-01-2021 00:30:00 122. 60.8 731. 71.8 71.8 431.
# 3 01-01-2021 01:00:00 114. 86.5 681. 165. 75.9 988.
# 4 01-01-2021 01:30:00 187. 61.9 1120. 156. 81.2 936.
# 5 01-01-2021 02:00:00 87.2 86.2 523. 130. 66.9 781.
# 6 01-01-2021 02:30:00 125. NA 125. 251. NA 251.