Search code examples
rdplyraggregatesummarize

How to use Dplyr's summarize function to summarize specific columns using a list of functions


Problem statement

I am writing a function right now that will aggregate (roll up) data at short time intervals up to longer time intervals. I am currently using dplyr and lubridate to accomplish this.

The input dataframe to my function has a time column and one or more independent variable columns. I want the user to be able to specify which functions are going to be used to aggregate data. This is an argument to my function called variable_aggregation which is a vector of functions like c(sum, median, mean, sd). The number of functions should match the number of independent variable columns in the input dataframe and be in an order that corresponds to the order of columns to apply the functions over.

The intention is that each function is only applied to the column it matches the order of. For example, if we have four independent variable columns, Ozone, Solar.R, Wind, and Temp, then the list of functions c(sum, median, mean, sd) would have Ozone aggregated by sum, Solar.R by median, Wind by mean, and Temp by standard deviation.

This is where I'm running into an issue. When dplyr's summarize function is provided a list of variables and functions, it will apply every function to every column, squaring the total number of output columns. For the example above, this means sum would be applied to all 4 columns, as would median, mean, and sd, resulting in 16 columns.

Attempted solution

So far I have lived with this and just trimmed out the extra columns, but this function is going to be used on very large datasets (down to minute-by-minute interval data over multiple years) with a potentially large number of independent variables. I'm worried about the performance impact all these unnecessary computations will have.

Here is a simple example of my current code:

aggregate_func <- function (independent_variables, variable_aggregation){
  
  variable_names <- colnames(independent_variables[, -1])
  
  hourly_independent_variables <- independent_variables %>%
    dplyr::mutate(hour = lubridate::floor_date(time, "hour")) %>%
    dplyr::group_by("time" = hour) %>%
    dplyr::summarize_at(.vars = variable_names,
                        .funs = variable_aggregation,
                        na.rm = T)
  
  # Set column names and delete extraneous columns created by the summarize function
  variable_aggregation_indices <- seq(from = 2,
                                      to = length(variable_names)^2 + 1,
                                      by = length(variable_names) + 1)
  
  names(hourly_independent_variables)[variable_aggregation_indices] <- variable_names
  hourly_independent_variables <- hourly_independent_variables[, c("time", variable_names)]
  
  return(hourly_independent_variables)
  
}

I know this can equivalently be done with a combo of summarize and across, as shown below, but it still suffers the same problem of generating too many columns. I also importantly miss out on being able to specify na.rm = T since they deprecated that in the across function.

  hourly_independent_variables <- independent_variables %>%
    dplyr::mutate(hour = lubridate::floor_date(time, "hour")) %>%
    dplyr::group_by("time" = hour) %>%
    dplyr::summarize(dplyr::across(.cols = dplyr::all_of(variable_names),
                                   .fns = variable_aggregation))

Reproducible inputs

Here are some inputs and reproducible code to test with. I'm using the built in R dataset "airquality"

library(dplyr)
library(lubridate)

ind_vars <- force(airquality)
time_series <- seq.POSIXt(from = lubridate::ymd_hms("2023-01-01 00:00:00", tz = "UTC"),
                          to = lubridate::ymd_hms("2023-01-02 14:00:00", tz = "UTC"),
                          by = lubridate::as.duration("15 mins"), tz = "UTC")

ind_vars <- ind_vars %>%
  dplyr::mutate(time = time_series, .before = Ozone) %>%
  dplyr::select(-c("Month", "Day"))

agg_funcs <- c(sum, median, mean, sd)

hourly_vars <- aggregate_func(ind_vars, agg_funcs)

Here's what the first 10 rows of the inputs look like:

                   time Ozone Solar.R Wind Temp
1   2023-01-01 00:00:00    41     190  7.4   67
2   2023-01-01 00:15:00    36     118  8.0   72
3   2023-01-01 00:30:00    12     149 12.6   74
4   2023-01-01 00:45:00    18     313 11.5   62
5   2023-01-01 01:00:00    NA      NA 14.3   56
6   2023-01-01 01:15:00    28      NA 14.9   66
7   2023-01-01 01:30:00    23     299  8.6   65
8   2023-01-01 01:45:00    19      99 13.8   59
9   2023-01-01 02:00:00     8      19 20.1   61
10  2023-01-01 02:15:00    NA     194  8.6   69

And here's what the first 10 rows of the outputs should look like


   time                Ozone Solar.R  Wind  Temp
   <dttm>              <int>   <dbl> <dbl> <dbl>
 1 2023-01-01 00:00:00   107   170.   9.88  5.38
 2 2023-01-01 01:00:00    70   199   12.9   4.80
 3 2023-01-01 02:00:00    31   194   11.3   5.38
 4 2023-01-01 03:00:00    57   282   11.2   4.32
 5 2023-01-01 04:00:00    81   192.  12.9   4.86
 6 2023-01-01 05:00:00    48    58.5 12     6.40
 7 2023-01-01 06:00:00    23    66   12.9   4.86
 8 2023-01-01 07:00:00   197   266.   9.15  2.08
 9 2023-01-01 08:00:00     0   231   10.9   8.58
10 2023-01-01 09:00:00   100   268.  11.2   4.93

Any suggestions are greatly appreciated. I'm trying to avoid using for loops to keep this code compact and readable.

Update: final working implementation

Thanks to deschen for the solution using map2 from purrr to wrap the summarize call. Here's what my final function looks like after implementing their framework. I'm choosing to use summarize_at instead of summarize and across because the specification of na.rm = TRUE as a separate argument to the across function is deprecated.

aggregate_func <- function (independent_variables, variable_aggregation){

  variable_names <- colnames(independent_variables[, -1])

  hourly_independent_variables <- map2(.x = variable_names,
                                       .y = variable_aggregation,
                                       .f = ~independent_variables %>%
                                         dplyr::mutate(hour = lubridate::floor_date(time, "hour")) %>%
                                         dplyr::group_by("time" = hour) %>%
                                         dplyr::summarize_at(.vars = .x,
                                                             .funs = .y,
                                                             na.rm = T)) %>%
    reduce(left_join, by = 'time')

  return(hourly_independent_variables)

}

Solution

  • Here‘s an example using map2, which kind of is using a for loop. We define the columns in .x (change that to your liking) and the functions in .y.

    In the across function you can play around with rhe column naming.

    library(tidyverse)
    
    map2(.x = 1:4,
         .y = c('sum', 'median', 'mean', 'sd'),
         .f = ~airquality %>%
                 group_by(Month) %>%
                 summarize(across(all_of(.x),
                                  list(.y),
                                  na.rm = TRUE,
                                  .names = '{.y}_{.col}'))) %>%
      reduce(left_join, by = 'Month')
    
     # A tibble: 5 x 5
      Month sum_Ozone median_Solar.R mean_Wind sd_Temp
      <int>     <int>          <dbl>     <dbl>   <dbl>
    1     5       614           194      11.6     6.85
    2     6       265           188.     10.3     6.60
    3     7      1537           253       8.94    4.32
    4     8      1559           198.      8.79    6.59
    5     9       912           192      10.2     8.36