I am writing a function right now that will aggregate (roll up) data at short time intervals up to longer time intervals. I am currently using dplyr and lubridate to accomplish this.
The input dataframe to my function has a time column and one or more independent variable columns. I want the user to be able to specify which functions are going to be used to aggregate data. This is an argument to my function called variable_aggregation
which is a vector of functions like c(sum, median, mean, sd)
. The number of functions should match the number of independent variable columns in the input dataframe and be in an order that corresponds to the order of columns to apply the functions over.
The intention is that each function is only applied to the column it matches the order of. For example, if we have four independent variable columns, Ozone, Solar.R, Wind, and Temp, then the list of functions c(sum, median, mean, sd)
would have Ozone aggregated by sum, Solar.R by median, Wind by mean, and Temp by standard deviation.
This is where I'm running into an issue. When dplyr
's summarize
function is provided a list of variables and functions, it will apply every function to every column, squaring the total number of output columns. For the example above, this means sum would be applied to all 4 columns, as would median, mean, and sd, resulting in 16 columns.
So far I have lived with this and just trimmed out the extra columns, but this function is going to be used on very large datasets (down to minute-by-minute interval data over multiple years) with a potentially large number of independent variables. I'm worried about the performance impact all these unnecessary computations will have.
Here is a simple example of my current code:
aggregate_func <- function (independent_variables, variable_aggregation){
variable_names <- colnames(independent_variables[, -1])
hourly_independent_variables <- independent_variables %>%
dplyr::mutate(hour = lubridate::floor_date(time, "hour")) %>%
dplyr::group_by("time" = hour) %>%
dplyr::summarize_at(.vars = variable_names,
.funs = variable_aggregation,
na.rm = T)
# Set column names and delete extraneous columns created by the summarize function
variable_aggregation_indices <- seq(from = 2,
to = length(variable_names)^2 + 1,
by = length(variable_names) + 1)
names(hourly_independent_variables)[variable_aggregation_indices] <- variable_names
hourly_independent_variables <- hourly_independent_variables[, c("time", variable_names)]
return(hourly_independent_variables)
}
I know this can equivalently be done with a combo of summarize
and across
, as shown below, but it still suffers the same problem of generating too many columns. I also importantly miss out on being able to specify na.rm = T
since they deprecated that in the across
function.
hourly_independent_variables <- independent_variables %>%
dplyr::mutate(hour = lubridate::floor_date(time, "hour")) %>%
dplyr::group_by("time" = hour) %>%
dplyr::summarize(dplyr::across(.cols = dplyr::all_of(variable_names),
.fns = variable_aggregation))
Here are some inputs and reproducible code to test with. I'm using the built in R dataset "airquality"
library(dplyr)
library(lubridate)
ind_vars <- force(airquality)
time_series <- seq.POSIXt(from = lubridate::ymd_hms("2023-01-01 00:00:00", tz = "UTC"),
to = lubridate::ymd_hms("2023-01-02 14:00:00", tz = "UTC"),
by = lubridate::as.duration("15 mins"), tz = "UTC")
ind_vars <- ind_vars %>%
dplyr::mutate(time = time_series, .before = Ozone) %>%
dplyr::select(-c("Month", "Day"))
agg_funcs <- c(sum, median, mean, sd)
hourly_vars <- aggregate_func(ind_vars, agg_funcs)
Here's what the first 10 rows of the inputs look like:
time Ozone Solar.R Wind Temp
1 2023-01-01 00:00:00 41 190 7.4 67
2 2023-01-01 00:15:00 36 118 8.0 72
3 2023-01-01 00:30:00 12 149 12.6 74
4 2023-01-01 00:45:00 18 313 11.5 62
5 2023-01-01 01:00:00 NA NA 14.3 56
6 2023-01-01 01:15:00 28 NA 14.9 66
7 2023-01-01 01:30:00 23 299 8.6 65
8 2023-01-01 01:45:00 19 99 13.8 59
9 2023-01-01 02:00:00 8 19 20.1 61
10 2023-01-01 02:15:00 NA 194 8.6 69
And here's what the first 10 rows of the outputs should look like
time Ozone Solar.R Wind Temp
<dttm> <int> <dbl> <dbl> <dbl>
1 2023-01-01 00:00:00 107 170. 9.88 5.38
2 2023-01-01 01:00:00 70 199 12.9 4.80
3 2023-01-01 02:00:00 31 194 11.3 5.38
4 2023-01-01 03:00:00 57 282 11.2 4.32
5 2023-01-01 04:00:00 81 192. 12.9 4.86
6 2023-01-01 05:00:00 48 58.5 12 6.40
7 2023-01-01 06:00:00 23 66 12.9 4.86
8 2023-01-01 07:00:00 197 266. 9.15 2.08
9 2023-01-01 08:00:00 0 231 10.9 8.58
10 2023-01-01 09:00:00 100 268. 11.2 4.93
Any suggestions are greatly appreciated. I'm trying to avoid using for
loops to keep this code compact and readable.
Thanks to deschen for the solution using map2
from purrr
to wrap the summarize
call. Here's what my final function looks like after implementing their framework. I'm choosing to use summarize_at
instead of summarize
and across
because the specification of na.rm = TRUE
as a separate argument to the across
function is deprecated.
aggregate_func <- function (independent_variables, variable_aggregation){
variable_names <- colnames(independent_variables[, -1])
hourly_independent_variables <- map2(.x = variable_names,
.y = variable_aggregation,
.f = ~independent_variables %>%
dplyr::mutate(hour = lubridate::floor_date(time, "hour")) %>%
dplyr::group_by("time" = hour) %>%
dplyr::summarize_at(.vars = .x,
.funs = .y,
na.rm = T)) %>%
reduce(left_join, by = 'time')
return(hourly_independent_variables)
}
Here‘s an example using map2
, which kind of is using a for loop. We define the columns in .x
(change that to your liking) and the functions in .y
.
In the across
function you can play around with rhe column naming.
library(tidyverse)
map2(.x = 1:4,
.y = c('sum', 'median', 'mean', 'sd'),
.f = ~airquality %>%
group_by(Month) %>%
summarize(across(all_of(.x),
list(.y),
na.rm = TRUE,
.names = '{.y}_{.col}'))) %>%
reduce(left_join, by = 'Month')
# A tibble: 5 x 5
Month sum_Ozone median_Solar.R mean_Wind sd_Temp
<int> <int> <dbl> <dbl> <dbl>
1 5 614 194 11.6 6.85
2 6 265 188. 10.3 6.60
3 7 1537 253 8.94 4.32
4 8 1559 198. 8.79 6.59
5 9 912 192 10.2 8.36