Search code examples
rfunctiondplyrtidyversemagrittr

R/dplyr function: Frequency table including totals of grouping variables


I am working with data from a household survey for which I want to calculate frequency tables of responses to various questions (multiple answers per respondent are possible). To speed up the analysis of the survey data I have written a few local functions while trying to stay in the tidyverse logic. I have successfully written a function that allows me to calculate frequencies while using grouping variables.

library(tibble)
library(dplyr)

my_df <- tibble(id = c(1, 1, 2, 2, 3, 4, 5, 7, 8, 8),
                country = c("A", "A", "A", "A", "A", "B", "B", "B", "B", "B"),
                region = c("ax", "ax", "ax", "ay", "ay", "bx", "bx", "by", "by", "by"),
                district = c("ax1", "ax1", "ax2", "ay1", "ay2", "bx1", "bx1", "by1", "by1", "by1"),
                question = c("answer1", "answer2", "answer1", "answer2", "answer1", "answer1", "answer1", "answer2", "answer1", "answer2"))

freq <- function(df, var, id_var, ...) {
  
  n <- df %>%
    group_by(...) %>% 
    summarise(n = NROW(unique({{id_var}})), .groups = "drop") %>% 
    left_join(distinct(df, {{var}}), by = character(), .)
  
  df %>%
    group_by(..., {{var}}) %>%
    summarise(cases = n(), .groups = "drop") %>%
    left_join(n) %>%
    mutate(freq_answer = cases/n*100) %>%
    ungroup()
}

my_df %>% 
  freq(question, id, country, region)

# A tibble: 7 x 6
  country region question cases     n freq_answer
  <chr>   <chr>  <chr>    <int> <int>       <dbl>
1 A       ax     answer1      2     2         100
2 A       ax     answer2      1     2          50
3 A       ay     answer1      1     2          50
4 A       ay     answer2      1     2          50
5 B       bx     answer1      2     2         100
6 B       by     answer1      1     2          50
7 B       by     answer2      2     2         100

For reporting purposes, in some cases I want to calculate the frequencies for each grouping level and combine the results in a single dataframe. I have found a solution that lets me do this for a fixed number of grouping variables and I am able to get the desired result. Obviously, if I use more or less grouping variables I would have to specify additional functions.

freq_sum <- function(df, var, id_var, group1, group2) {
  
  df0 <- freq({{df}}, {{var}}, {{id_var}}) %>%
    add_column({{group1}} :="Total", .before = 1) %>% 
    add_column({{group2}} :="Total", .after = 1)
  
  df1 <- freq({{df}}, {{var}}, {{id_var}}, {{group1}}) %>%
    add_column({{group2}} :="Total", .after = 1)
  
  df2 <- freq({{df}}, {{var}}, {{id_var}}, {{group1}}, {{group2}})
  
  rbind(df2, df1, df0)
  
}

my_df %>% 
  freq_sum(question, id, country, region)

   country region question cases     n freq_answer
   <chr>   <chr>  <chr>    <int> <int>       <dbl>
 1 A       ax     answer1      2     2       100  
 2 A       ax     answer2      1     2        50  
 3 A       ay     answer1      1     2        50  
 4 A       ay     answer2      1     2        50  
 5 B       bx     answer1      2     2       100  
 6 B       by     answer1      1     2        50  
 7 B       by     answer2      2     2       100  
 8 A       Total  answer1      3     3       100  
 9 A       Total  answer2      2     3        66.7
10 B       Total  answer1      3     4        75  
11 B       Total  answer2      2     4        50  
12 Total   Total  answer1      6     7        85.7
13 Total   Total  answer2      4     7        57.1

My question: Does anyone have any suggestions on how to make the freq_sum function more general/elegant which does not require the prior specification of the number of grouping variables?

I had some initial ideas on how to achieve this but am not sure how to implement them or whether they are feasible in the first place.

freq_sum <- function(df, var, id, ...) {
  
  df0 <- df %>% 
    freq({{var}}, {{id}}, ...)
  
  grouping_vars <- df0 %>% 
    select(1:{{var}}) %>%
    select(-last_col()) %>% 
    names()
  
  # From grouping_vars create a list with vectors that contain increasingy less grouping variables.
  [1] "country" "region" "district"
  [2] "country" "region"
  [3] "country"
  
  # Use the elements of the list as input in the freq() function.
  # Add the missing grouping variables to the resulting data frames.
  # Combine all dataframes in a single data frame.
  
}


Solution

  • In case anyone has a similar problem: With the help of the answers to the two questions below, I found a solution involving a for loop that works as intended and lets me freely chose the number of grouping variables to be summarised.

    R: Create empty tibble/data frame with column names coming from a vector

    Adding column if it does not exist

    freq_sum <- function(df, var, id_var, ...) {
       
      var_names <- names(select(df, ...))
      df_total <- bind_rows(setNames(rep("Total", length(var_names)), var_names))
      df_final <- df %>% freq({{var}}, {{id_var}}, ...)
      
      for (i in 1:length(var_names)-1) {
        
        v <- var_names[1:i]
        
        df_final <- df %>%
          freq({{var}}, {{id_var}}, across(v)) %>% 
          add_column(!!!df_total[!names(df_total) %in% names(.)]) %>% 
          rbind(df_final,.) %>% 
          distinct()
      }
      
      df %>%
        freq({{var}}, {{id_var}}) %>%
        add_column(!!!df_total[!names(df_total) %in% names(.)]) %>%
        rbind(df_final, .)
    }
    
    my_df %>% 
      freq_sum(question, id, country, region)
    
    # A tibble: 13 x 6
       country region question cases     n freq_answer
       <chr>   <chr>  <chr>    <int> <int>       <dbl>
     1 A       ax     answer1      2     2       100  
     2 A       ax     answer2      1     2        50  
     3 A       ay     answer1      1     2        50  
     4 A       ay     answer2      1     2        50  
     5 B       bx     answer1      2     2       100  
     6 B       by     answer1      1     2        50  
     7 B       by     answer2      2     2       100  
     8 A       Total  answer1      3     3       100  
     9 A       Total  answer2      2     3        66.7
    10 B       Total  answer1      3     4        75  
    11 B       Total  answer2      2     4        50  
    12 Total   Total  answer1      6     7        85.7
    13 Total   Total  answer2      4     7        57.1