Search code examples
rdplyr

Is there a way to automate generating a table with dplyr?


I have a dataset that I'm trying to break down by sex. I've made a table of one of the variables ("awarskin") broken down by counts of male/female responses with the following code:

df %>% 
  group_by(x, sex) %>% 
  mutate(n=n()) %>% #Create counts column
  group_by(x) %>% 
  distinct(x,sex,n) %>% 
  mutate(Per=n/sum(n), np=paste0(n," (",round(Per*100,2)," %)")) %>% #Create and define percentage column
  select(-n,-Per) %>% #Select only the counts and percentages
  spread(sex,np) #Spread sex values across columns

However, there are over 50 variables that I need to do this for, and I'd like to come up with a function that can automate the creation of a table like this, given the names of two columns in the dataset.

I attempted to solve the problem by defining a function, and the code is below.

sex_percent_table <- function(column){
ctdf %>% 
  group_by(column, sex) %>% 
  mutate(n=n()) %>% #Create counts column
  group_by(column) %>% 
  distinct(column,sex,n) %>% 
  mutate(Per=n/sum(n), np=paste0(n," (",round(Per*100,2)," %)")) %>% #Create and define percentage column
  select(-n,-Per) %>% #Select only the counts and percentages
  spread(sex,np) #Spread sex values across columns
}

This code also works fine, but when I try to actually call the function for a variable, like so sex_percent_table(markex) it throws an error saying "column 'x' is not found." The backtrace is as follows:

Error in group_by(., x, sex) : 
✖ Column `x` is not found.
14.
stop(fallback)
13.
signal_abort(cnd, .file)
12.
abort(bullets, call = error_call)
11.
group_by_prepare(.data, ..., .add = .add, caller_env = caller_env())
10.
group_by.data.frame(., x, sex)
9.
group_by(., x, sex)
8.
mutate(., n = n())
7.
group_by(., x)
6.
distinct(., x, sex, n)
5.
mutate(., Per = n/sum(n), np = paste0(n, " (", round(Per * 100,
2), " %)"))
4.
select(., -n, -Per)
3.
spread(., sex, np)
2.
ctdf %>% group_by(x, sex) %>% mutate(n = n()) %>% group_by(x) %>%
distinct(x, sex, n) %>% mutate(Per = n/sum(n), np = paste0(n,
" (", round(Per * 100, 2), " %)")) %>% select(-n, -Per) %>%
spread(sex, np)
1.
sex_percent_table(markex)

I'm still somewhat new to R and coding in general, and I would appreciate any explanation of what's going wrong and how I can correctly go about automating this.

As a reproducible example, here's the same function applied to the mtcars dataset, using the binary vs variable.

percent_table <- function(column){
mtcars %>% 
  group_by(column, vs) %>% 
  mutate(n=n()) %>% #Create counts column
  group_by(column) %>% 
  distinct(column,vs,n) %>% 
  mutate(Per=n/sum(n), np=paste0(n," (",round(Per*100,2)," %)")) %>% #Create and define percentage column
  select(-n,-Per) %>% #Select only the counts and percentages
  spread(vs,np) #Spread values across columns
}

Solution

  • You can use curly brackets {{column}} to tell dplyr that you mean the column name.

    Example with Mtcars

    This R code creates a function vs_perc_table() that generates a percentage breakdown table for the 'vs' (V/S - V-shaped or Straight) column in the mtcars dataset.

    The lapply part applies this function to every column in the dataset. do.call(rbind, results) combines all these individual column analyses into one final_table.

    library(dplyr)
    library(tidyverse)
        
    ctdf <- mtcars
    
    vs_perc_table <- function(column) {
      
      col_quo <- enquo(column)
      col_name <- quo_name(col_quo)
      
      ctdf %>% 
        group_by({{column}}, vs) %>% 
        mutate(n = n()) %>% 
        group_by({{column}}) %>% 
        distinct({{column}}, vs, n) %>% 
        mutate(Per = n/sum(n), 
               np = paste0(n, " (", round(Per*100, 2), " %)"),
               col = col_name,
               val = {{column}}) %>% 
        select(-n, -Per) %>% 
        spread(vs, np) %>% 
        ungroup() %>% 
        select(-!!col_quo) # deselect the column, because it is already denoted in col
    }
    
    vs_perc_table(gear)
    
    # Get the column symbols
    column_syms <- syms(colnames(ctdf)[colnames(ctdf) != "vs"])
    
    # Use lapply with the column symbols for a dataframe list
    results <- lapply(column_syms, function(col) vs_perc_table(!!sym(col)))
    
    # If you want to combine all results
    final_table <- do.call(rbind, results)
    

    Final Table (Head)

    col val 0 1
    cyl 4.000 1 (9.09 %) 10 (90.91 %)
    cyl 6.000 3 (42.86 %) 4 (57.14 %)
    cyl 8.000 14 (100 %) NA
    gear 3.000 12 (80 %) 3 (20 %)
    gear 4.000 2 (16.67 %) 10 (83.33 %)
    gear 5.000 4 (80 %) 1 (20 %)

    Now how to do it much simpler

    until now I just showed you how to do it with your function. However we can do the same without a function or lapply:

    final_table <- mtcars %>%
      pivot_longer(cols = -c(vs), names_to = "col", values_to = "val") %>%
      group_by(col, val, vs) %>%
      summarise(n = n()) %>%
      group_by(col, val) %>%
      mutate(np = paste0(n, " (", round(n/sum(n)*100, 2), " %)")) %>%
      select(-n) %>%
      pivot_wider(names_from = vs, values_from = np, names_prefix = "vs_")
    

    which we can adapt for your case:

    final_table <- ctdf %>%
      pivot_longer(cols = -c(sex), names_to = "col", values_to = "val") %>%
      group_by(col, val, sex) %>%
      summarise(n = n()) %>%
      group_by(col, val) %>%
      mutate(np = paste0(n, " (", round(n/sum(n)*100, 2), " %)")) %>%
      select(-n) %>%
      pivot_wider(names_from = sex, values_from = np, names_prefix = "sex_")