Search code examples
rdplyrtidyversetidyrrlang

Turn dplyr group_by/summarize/spread into a function


I have the following data (some organizations, kpi measurements, long list of variables (i gave two in the example).

   df <- tibble::tribble(
  ~ORG_NM, ~KPI_NM,        ~NUMR_VAL,       ~DENO_VAL,
    "AAA",   "xxx",                8,              10,
    "AAA",   "xxx",               10,              10,
    "BBB",   "xxx",                1,               7,
    "CCC",   "xxx",                9,               3,
    "CCC",   "yyy",                9,               4,
    "DDD",   "xxx",                1,               7,
    "AAA",   "yyy",                8,               3,
    "BBB",   "yyy",                6,               1
  )

I would like to summarize each of the variables and generate a wide table so that each organization has only one record. So far my approach was to use repetitive code in which I need to change, - summarise part with a name of a variable to be summarized and a function this variable should be processed with, - spread part - value = change the name of the new column, - rename_at - add meaningful suffix to all spread columns with values, showing what function was used to achieve them. At the end I need to change the name of the dataframe in full_join to append the new columns.

library(tidyverse)    


df_numrtr <- df %>%    
  group_by(ORG_NM, KPI_NM) %>%    
  summarise(mean_NUM_VAL = mean(NUMR_VAL)) %>%    
  spread(key = c(KPI_NM),  mean_NUM_VAL) %>%      
  ungroup() %>%    
  rename_at(vars(-ORG_NM), function(x) paste0(x, "_num_mean"))


df_denom  <- df %>%   
  group_by(ORG_NM, KPI_NM) %>%    
  summarise(mean_DENOM_VAL = mean(DENO_VAL)) %>%    
  spread(key = c(KPI_NM),  mean_DENOM_VAL) %>%      
  ungroup() %>%    
  rename_at(vars(-ORG_NM), function(x)    
    paste0(x, "_den_mean"))



df_final <-    
  df_numrtr %>%     
  full_join(df_denom) %>%     
  select(ORG_NM, sort(names(.))) 

  ORG_NM xxx_den_mean xxx_num_mean yyy_den_mean yyy_num_mean
  <chr>         <dbl>        <dbl>        <dbl>        <dbl>
1 AAA              10            9            3            8
2 BBB               7            1            1            6
3 CCC               3            9            4            9
4 DDD               7            1           NA           NA

i would like to get rid of repetitive code and have a function that will take in a name of a variable and a function. My desired function in pseudocode would look like

fnSummarize <- function(df, my_org_var, my_kpi_var, my_var, my_fun ){
  df_output<-df %>%    
    group_by({{my_groupby_var}}) %>%    
    summarise(paste0({{my_var}},"_",{{my_fun}}) = my_fun({{my_var}})) %>%    
    spread(key = {{my_kpi_var}},  paste0(my_var, my_fun)) %>%      
    ungroup()  %>% 
    rename_at(vars(-{{ my_org_var}}), function(x) paste0(x, {{myfun}}))
  return(df_output)
}

How to properly inject column names and a function to be used in the process (like mean, sum, median, sd) to such a function.


Solution

  • You are quite close. The problem is composition of the column name, which I pulled out into a separate line:

    fnSummarize <- function(df, my_org_var, my_kpi_var, my_var, my_fun ){
      colName <- str_c( rlang::enexpr(my_var),"_",rlang::enexpr(my_fun) )
    
      df %>%
        group_by( {{my_org_var}}, {{my_kpi_var}} ) %>%
        summarise( !!colName := {{my_fun}}({{my_var}}) ) %>%
        spread( key = {{my_kpi_var}}, colName ) %>%
        ungroup() %>%
        rename_at( vars(-{{my_org_var}}), str_c, "_", colName )
    }
    

    The column names are slightly different than your df_numrtr and df_denom, but this can be easily adjusted through additional string manipulation. I left it out to keep things clean.

    fnSummarize( df, ORG_NM, KPI_NM, NUMR_VAL, mean )
    # # A tibble: 4 x 3
    #    ORG_NM xxx_NUMR_VAL_mean yyy_NUMR_VAL_mean
    #    <chr>              <dbl>             <dbl>
    #  1 AAA                    9                 8
    #  2 BBB                    1                 6
    #  3 CCC                    9                 9
    #  4 DDD                    1                NA
    
    ## Demonstrating using sum instead of mean
    fnSummarize( df, ORG_NM, KPI_NM, DENO_VAL, sum )
    #  # A tibble: 4 x 3
    #    ORG_NM xxx_DENO_VAL_sum yyy_DENO_VAL_sum
    #    <chr>             <dbl>            <dbl>
    #  1 AAA                  20                3
    #  2 BBB                   7                1
    #  3 CCC                   3                4
    #  4 DDD                   7               NA
    

    I also want to point out that you can address your task through pure dplyr manipulations, without the need for rlang. For example, here's how you would apply both sum and mean:

    df %>% group_by( ORG_NM, KPI_NM ) %>%
      summarize_at( c("NUMR_VAL", "DENO_VAL"), list(mean=mean,sum=sum) ) %>%
      ungroup() %>% gather( "Variable", "Value", -ORG_NM, -KPI_NM ) %>% 
      mutate( Variable = map2_chr(Variable, KPI_NM, ~str_replace(.x,"VAL",.y)) ) %>%
      select( -KPI_NM ) %>% spread( Variable, Value )
    # # A tibble: 4 x 9
    #    ORG_NM DENO_xxx_mean DENO_xxx_sum DENO_yyy_mean DENO_yyy_sum NUMR_xxx_mean
    #    <chr>          <dbl>        <dbl>         <dbl>        <dbl>         <dbl>
    #  1 AAA               10           20             3            3             9
    #  2 BBB                7            7             1            1             1
    #  3 CCC                3            3             4            4             9
    #  4 DDD                7            7            NA           NA             1
    #  # … with 3 more variables: NUMR_xxx_sum <dbl>, NUMR_yyy_mean <dbl>,
    #  #   NUMR_yyy_sum <dbl>