Search code examples
rdplyrtidyr

What alternative is faster to create new rows based on an external vector in R?


A typical use case where we want to duplicate rows to fit an external vector of length > 1 is when we want to introduce new dates, or allow for each date to show a different individual.

Imagine that we wanted to create measurements for each month in the iris dataset, one option would be to do it this way:

group_and_tidyr_expand <- function(df){
  df %>% 
    group_by(pick(everything())) %>% 
    tidyr::expand(date = seq.Date(from = as.Date("2022-01-01"), to = as.Date("2023-01-01"), by = "1 month")) %>% 
    ungroup()
}

However, dplyr now has a function that allows to have more than one value (i.e., output row) per grouping, which is reframe. The equivalent to the code above would be:

group_and_reframe <- function(df){
  df %>% 
    group_by(pick(everything())) %>% 
    reframe(date = seq.Date(from = as.Date("2022-01-01"), to = as.Date("2023-01-01"), by = "1 month"))
}

Which of these two alternatives is faster?

Note: There is no need for ungroup to appear in reframe, as it already ungroups the output by default.


Solution

  • After benchmarking, it looks like reframe could be around three times faster than tidyr::expand.

    microbenchmark::microbenchmark(
        "reframe" = group_and_reframe(iris), 
        "tidyr_expand" = group_and_tidyr_expand(iris)
    )
    
    #> Unit: milliseconds
    #>          expr      min        lq      mean   median       uq      max neval
    #>       reframe  59.8355  63.90195  74.08853  70.2310  79.2985 219.7671   100
    #>  tidyr_expand 207.6898 227.03515 261.62132 243.3526 269.4305 517.4461   100