Search code examples
rdplyrtidyverserefactoringsummarize

How do I dplyr `summarize` and/or `mutate` over a list of parameter values, without an external `for` loop or generating extra columns?


I've been asked to generate something like this:

library("tidyverse")
df <- tibble(
  comparison_var = c(1:10), 
  var_to_sum = runif(10, max = 10), 
  other_var = runif(10, max = 50)
)

summary_tbl <- df |>
  summarize(
    other_summary = mean(other_var),
    sum_gt2 = sum((comparison_var > 2) * var_to_sum),
    sum_gt3 = sum((comparison_var > 3) * var_to_sum),
    #...,
    sum_gtN = sum((comparison_var > 9) * var_to_sum)
  )

Without repeating myself. In words, generate an arbitrary number of columns calculated with a (single) simple varying parameter value. Technically, I already know some ways to get this done:

  • I could create N columns, e.g. dummies indicating (comparison_var > n), then across(starts_with(...), ...).
    • Even this begs the question, though, of how to generate those without sticking mutate inside a for loop or apply/map call? I.e. can the looping be done, in pipe chain, inside mutate(...)?
  • I could put summarize() within a loop, for (cutoff in 1:N) {...}, I guess _join()ing the results onto the existing summary table.
  • I could, presumably, again create columns with each parameter value, or probably more sophisticatedly one list column with a list of them, then again use across(), or some maybe some map/apply thing that returns a dataframe.

Some of these are not horribly inelegant, but it really feels like this should be something super basic. It's the exact same idea as across(), except across an external vector rather than a number of columns. In fact, every time I try to google for it, all I find are questions about how to use across.

I would really like something that doesn't involve generating extra columns, or looping, or really, ideally, leaving a pipe chain.

I did notice that a value of summarize() or mutate() can be a tibble itself, generating more than one column at once. That led to writing this:

worldpop_educ_splits <- merged_tbl |>
  summarize(
    across(
      starts_with("yr_sch_gt"),
      ~ sum(.x * pop, na.rm = TRUE) / sum(pop, na.rm = TRUE),
      .names = "fracwt_{.col}"
    ),
    {
      tbl <- tibble(.rows = 1)
      for (cutoff in 2:9) {
        tbl[[1, glue("fracwt_yrsch_gt{cutoff}")]] <-
          sum((yr_sch > cutoff) * pop, na.rm = TRUE) / sum(pop, na.rm = TRUE)
      }
      tbl
    },
    .by = year
  )

Which technically works, but is abhorrent, and, as you can guess, insanely slow. I left the the first across() call in the example as it is the same calculation that I want to do in the 2nd one, but for a variable which I already had an indicator for each cutoff level. I just would prefer not to make an indicator for this one too, and would like to stop doing so for the first as well.

Yes, I can just mutate up some columns, then drop them later or let them disappear with summarize(). It just feels like there has to be a very obvious, elegant way to do this that I'm missing.

Thanks much, in advance, for your input!


Solution

  • I'd also go with the summrise-function-returning-a-dataframe route, but I think you are overthinking it a bit. This should do, works with groups as well:

    library(tidyverse)
    
    df |>
      summarize(
        other_summary = mean(other_var),
        sum_gt = 2:9 |> 
          set_names() |> 
          map(\(gt) sum(var_to_sum[comparison_var > gt])) |> 
          bind_cols(),
      ) |> 
      unnest_wider(sum_gt, names_sep = "")
    #> # A tibble: 1 × 9
    #>   other_summary sum_gt2 sum_gt3 sum_gt4 sum_gt5 sum_gt6 sum_gt7 sum_gt8 sum_gt9
    #>           <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
    #> 1          29.5    45.1    42.2    33.9    27.5    22.3    15.0    13.6    7.05
    

    Reference:

    df |>
      summarize(
        other_summary = mean(other_var),
        sum_gt2 = sum((comparison_var > 2) * var_to_sum),
        sum_gt3 = sum((comparison_var > 3) * var_to_sum),
        #...,
        sum_gtN = sum((comparison_var > 9) * var_to_sum)
      )
    #> # A tibble: 1 × 4
    #>   other_summary sum_gt2 sum_gt3 sum_gtN
    #>           <dbl>   <dbl>   <dbl>   <dbl>
    #> 1          29.5    45.1    42.2    7.05
    

    Example data:

    set.seed(42)
    df <- tibble(
      comparison_var = c(1:10), 
      var_to_sum = runif(10, max = 10), 
      other_var = runif(10, max = 50)
    )
    df
    #> # A tibble: 10 × 3
    #>    comparison_var var_to_sum other_var
    #>             <int>      <dbl>     <dbl>
    #>  1              1       9.15     22.9 
    #>  2              2       9.37     36.0 
    #>  3              3       2.86     46.7 
    #>  4              4       8.30     12.8 
    #>  5              5       6.42     23.1 
    #>  6              6       5.19     47.0 
    #>  7              7       7.37     48.9 
    #>  8              8       1.35      5.87
    #>  9              9       6.57     23.7 
    #> 10             10       7.05     28.0
    

    bind_cols() can be replaced with list() to avoid some tibble overhead & gain some slight performance boost, but difference is surprisingly small, around 2ms on my machine when testing with 100000 rows.