Search code examples
rapplypurrr

Purrr summarising one dataframe and using the result to add columns to another row wise


The problem: I have two datasets that are in essence time sequences with something measured at each timepoint. Df1 has a sample point every 30 mins, while df2 has a sample point every 5 min. I want to join the two datasets but to do that I need to summarise df2 so it also has only 1 value per 30 min. However, these intervals are not always constant and therefore this is the solution I've come up to, but I need to then be able to do this for every row of df1. I think there are purrr functions that will allow me to map a custom function row wise, but I haven't quite figured out how to.

This is what I've done so far

0 - Packages

library(lubridate)
library(tidyverse)
library(wakefield)
library(magrittr)

1 - Generate dummy datasets for example

df1 <- data.frame(date_time = format(seq(as.POSIXct("2021-01-01 00:00:00", tz="GMT"), 
                                     as.POSIXct("2021-01-01 02:30:00", tz="GMT"), 
                                     by='30 min'), "%d-%m-%Y %H:%M:%S"), 
              other_vars = upper(6, k = 5, x = LETTERS, prob = NULL, name = "Upper"))


df2 <- data.frame(date_time = format(seq(as.POSIXct("2021-01-01 00:00:00", tz="GMT"), 
                                     as.POSIXct("2021-01-01 02:30:00", tz="GMT"), 
                                     by='5 min'), "%d-%m-%Y %H:%M:%S"),
              num_var1 = runif(31, 0, 255), 
              num_var2 = runif(31, 0, 255))

2 - Add time at row i+1 as new column in df1

df1 %<>% 
  mutate(date_time_lead = lead(date_time))

3 - Subset rows of df2 with time points between time i and i+1 (in this case i == 1) of df1 and summarise the values (mean, sd, sum)

df2_sub <- df2  %>% 
  filter(date_time >= df1$date_time[1] & date_time < df1$date_time_lead[1]) %>% 
  summarise(var1_mean = mean(num_var1, na.rm = T), 
            var1_sd = sd(num_var1, na.rm = T), 
            var1_sum = sum(num_var1, na.rm = T),
            var2_mean = mean(num_var2, na.rm = T), 
            var2_sd = sd(num_var2, na.rm = T), 
            var2_sum = sum(num_var2, na.rm = T))

Now, ideally I'd put all of this in a custom function and then use one of the purrr functions to apply that function to each row of df1, but I haven't managed to figure out how to, so this is the next step

4 - Generate new columns in df1

df1$var1_mean <- NA
df1$var1_sd <- NA
df1$var1_sum <- NA
df1$var2_mean <- NA
df1$var2_sd <- NA
df1$var2_sum <- NA

5 - Add the newly generated summarised values to the ith row (in this case the first) of df1

df1[1, 4:9] <- df2_sub[1,]

Solution

  • Are you looking for something like this?

    Hard to tell if it is correct, since your sample data is not minimal, and I do not get the use of logical operators in the line:
    date_time > df1$date_time[1] & date_time < df1$date_time_lead[1])
    Which suggests that in the interval 0:00-0:30 you want to exclude the values of 0:00 and 0:30 ?

    library(data.table)
    # set to data.table format
    setDT(df1)
    setDT(df2)
    # perform rolling join, to roll down to the nearest date_from 
    df2[, c("date_time_df1", "other_vars") := df1[df2, .(x.date_time, x.other_vars), on = .(date_time), roll = Inf]]
    # melt to long
    df2.melt <- melt(df2, measure.vars = patterns("^num_var"))
    # summarise
    df2.melt[, .(mean = mean(value, na.rm = TRUE),
                 sd = sd(value, na.rm = TRUE),
                 sum = sum(value, na.rm = TRUE)), 
             by = .(date_time = date_time_df1, other_vars, variable)]
    #                  date_time other_vars variable      mean       sd       sum
    #     1: 01-01-2021 00:00:00          B num_var1 183.51225 57.36997 1101.0735
    #     2: 01-01-2021 00:30:00          B num_var1  96.46239 69.12586  578.7744
    #     3: 01-01-2021 01:00:00          D num_var1 163.74048 84.57741  982.4429
    #     4: 01-01-2021 01:30:00          A num_var1 120.74330 41.14782  724.4598
    #     5: 01-01-2021 02:00:00          E num_var1 109.74861 85.56809  658.4917
    # ---                                                                     
    # 34942: 30-12-2021 22:00:00          E num_var2 140.67153 79.74211  844.0292
    # 34943: 30-12-2021 22:30:00          B num_var2 101.25896 75.61385  607.5538
    # 34944: 30-12-2021 23:00:00          E num_var2  77.86649 50.45125  467.1989
    # 34945: 30-12-2021 23:30:00          D num_var2 152.96687 66.00130  917.8012
    # 34946: 31-12-2021 00:00:00          E num_var2 245.89731       NA  245.8973
    

    tidyverse solution

    using dplyr >= v1.1.0

    df2 %>%
      left_join(df1, join_by(closest(date_time >= date_time))) %>%
      group_by(date_time = date_time.y) %>%
      summarise(across(starts_with("num_"), list(mean = mean, sd = sd, sum = sum)))
    # A tibble: 6 × 7
    #   date_time           num_var1_mean num_var1_sd num_var1_sum num_var2_mean num_var2_sd num_var2_sum
    #   <chr>                       <dbl>       <dbl>        <dbl>         <dbl>       <dbl>        <dbl>
    # 1 01-01-2021 00:00:00         149.         88.6         896.         116.         91.5         697.
    # 2 01-01-2021 00:30:00         122.         60.8         731.          71.8        71.8         431.
    # 3 01-01-2021 01:00:00         114.         86.5         681.         165.         75.9         988.
    # 4 01-01-2021 01:30:00         187.         61.9        1120.         156.         81.2         936.
    # 5 01-01-2021 02:00:00          87.2        86.2         523.         130.         66.9         781.
    # 6 01-01-2021 02:30:00         125.         NA           125.         251.         NA           251.