A typical use case where we want to duplicate rows to fit an external vector of length > 1 is when we want to introduce new dates, or allow for each date to show a different individual.
Imagine that we wanted to create measurements for each month in the iris dataset, one option would be to do it this way:
group_and_tidyr_expand <- function(df){
df %>%
group_by(pick(everything())) %>%
tidyr::expand(date = seq.Date(from = as.Date("2022-01-01"), to = as.Date("2023-01-01"), by = "1 month")) %>%
ungroup()
}
However, dplyr
now has a function that allows to have more than one value (i.e., output row) per grouping, which is reframe
. The equivalent to the code above would be:
group_and_reframe <- function(df){
df %>%
group_by(pick(everything())) %>%
reframe(date = seq.Date(from = as.Date("2022-01-01"), to = as.Date("2023-01-01"), by = "1 month"))
}
Which of these two alternatives is faster?
Note: There is no need for ungroup
to appear in reframe, as it already ungroups the output by default.
After benchmarking, it looks like reframe
could be around three times faster than tidyr::expand
.
microbenchmark::microbenchmark(
"reframe" = group_and_reframe(iris),
"tidyr_expand" = group_and_tidyr_expand(iris)
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> reframe 59.8355 63.90195 74.08853 70.2310 79.2985 219.7671 100
#> tidyr_expand 207.6898 227.03515 261.62132 243.3526 269.4305 517.4461 100