Search code examples
rsortingdplyrsummarize

dplyr::summarize_at – sort columns by order of variables passed, then by order of functions applied


Problem

By using dplyr::summarize_at() (or equivalent), I would like to get a table of summaries in which columns are sorted first by (G) order of grouping variables used, then by (V) order of variables passed and lastly by (F) order of functions applied. The default order is determined first by G, then by F and lastly by V.

Example

The code:

library(purrr)
library(dplyr)

q025 <- partial(quantile, probs  = 0.025, na.rm = TRUE)
q975 <- partial(quantile, probs  = 0.975, na.rm = TRUE)

vars_to_summarize <- c("height", "mass")

my_summary <- starwars %>% 
    filter(skin_color  %in% c("gold", "green")) %>% 
    group_by(skin_color) %>% 
    summarise_at(vars_to_summarize, funs(q025, mean, q975))

Results in:

my_summary
## A tibble: 2 x 7
##   skin_color height_q025 mass_q025 height_mean mass_mean height_q975 mass_q975
##        <chr>       <dbl>     <dbl>       <dbl>     <dbl>       <dbl>     <dbl>
## 1       gold     167.000      75.0         167        75      167.00      75.0
## 2      green      79.375      22.7         169        NA      204.75     110.4

And the desired order of variables should be:

skin_color, height_q025, height_mean, height_q975, mass_q025, mass_mean, mass_q975

I would like to use something like this (naively simple) code:

my_summary  %>% 
    select(everything(), starts_with(vars_to_summarize))

But it does not work. Even this code does not work as I expect (even though it's not a general solution I seek):

my_summary  %>% 
    select(everything(),
           starts_with(vars_to_summarize[1]),
           starts_with(vars_to_summarize[2]))

Most probably everything() should always be the last argument in select().

To generalize

Say, I have:

  1. N grouping variables ("gr_") that I pass to group_by(),
  2. L variables ("var_") that have to be summarized and
  3. M summary functions to apply ("fun_").

In general, the desired order of variables in the summary table should follow the pattern:

gr_1, gr_2, ..., gr_N,   
var_1_fun_1, var_1_fun_2, ..., var_1_fun_M,  
var_2_fun_1, var_2_fun_2, ..., var_2_fun_M, 
...,
var_L_fun_1, var_L_fun_2, ..., var_L_fun_M

Solution

  • We can use matches and grep

    my_summary %>%
        select(grep(paste(vars_to_summarize, collapse="|"), names(.), invert = TRUE), 
               matches(vars_to_summarize[1]),
               matches(vars_to_summarize[2]))
    # A tibble: 2 x 7
    #    skin_color height_q025 height_mean height_q975 mass_q025 mass_mean mass_q975
    #       <chr>       <dbl>       <dbl>       <dbl>     <dbl>     <dbl>     <dbl>
    #1       gold     167.000         167      167.00      75.0        75      75.0
    #2      green      79.375         169      204.75      22.7        NA     110.4
    

    If there are many columns, then another option would be to remove the substring from _ in the column names, match with the 'vars_to_summarize' and order within the select

    my_summary %>% 
       select(order(match(sub("_.*", "", names(.)), vars_to_summarize, nomatch = 0)))
    # A tibble: 2 x 7
    #    skin_color height_q025 height_mean height_q975 mass_q025 mass_mean mass_q975
    #       <chr>       <dbl>       <dbl>       <dbl>     <dbl>     <dbl>     <dbl>
    #1       gold     167.000         167      167.00      75.0        75      75.0
    #2      green      79.375         169      204.75      22.7        NA     110.4