I have a dataset that I'm trying to break down by sex. I've made a table of one of the variables ("awarskin") broken down by counts of male/female responses with the following code:
df %>%
group_by(x, sex) %>%
mutate(n=n()) %>% #Create counts column
group_by(x) %>%
distinct(x,sex,n) %>%
mutate(Per=n/sum(n), np=paste0(n," (",round(Per*100,2)," %)")) %>% #Create and define percentage column
select(-n,-Per) %>% #Select only the counts and percentages
spread(sex,np) #Spread sex values across columns
However, there are over 50 variables that I need to do this for, and I'd like to come up with a function that can automate the creation of a table like this, given the names of two columns in the dataset.
I attempted to solve the problem by defining a function, and the code is below.
sex_percent_table <- function(column){
ctdf %>%
group_by(column, sex) %>%
mutate(n=n()) %>% #Create counts column
group_by(column) %>%
distinct(column,sex,n) %>%
mutate(Per=n/sum(n), np=paste0(n," (",round(Per*100,2)," %)")) %>% #Create and define percentage column
select(-n,-Per) %>% #Select only the counts and percentages
spread(sex,np) #Spread sex values across columns
}
This code also works fine, but when I try to actually call the function for a variable, like so sex_percent_table(markex)
it throws an error saying "column 'x' is not found." The backtrace is as follows:
Error in group_by(., x, sex) :
✖ Column `x` is not found.
14.
stop(fallback)
13.
signal_abort(cnd, .file)
12.
abort(bullets, call = error_call)
11.
group_by_prepare(.data, ..., .add = .add, caller_env = caller_env())
10.
group_by.data.frame(., x, sex)
9.
group_by(., x, sex)
8.
mutate(., n = n())
7.
group_by(., x)
6.
distinct(., x, sex, n)
5.
mutate(., Per = n/sum(n), np = paste0(n, " (", round(Per * 100,
2), " %)"))
4.
select(., -n, -Per)
3.
spread(., sex, np)
2.
ctdf %>% group_by(x, sex) %>% mutate(n = n()) %>% group_by(x) %>%
distinct(x, sex, n) %>% mutate(Per = n/sum(n), np = paste0(n,
" (", round(Per * 100, 2), " %)")) %>% select(-n, -Per) %>%
spread(sex, np)
1.
sex_percent_table(markex)
I'm still somewhat new to R and coding in general, and I would appreciate any explanation of what's going wrong and how I can correctly go about automating this.
As a reproducible example, here's the same function applied to the mtcars
dataset, using the binary vs
variable.
percent_table <- function(column){
mtcars %>%
group_by(column, vs) %>%
mutate(n=n()) %>% #Create counts column
group_by(column) %>%
distinct(column,vs,n) %>%
mutate(Per=n/sum(n), np=paste0(n," (",round(Per*100,2)," %)")) %>% #Create and define percentage column
select(-n,-Per) %>% #Select only the counts and percentages
spread(vs,np) #Spread values across columns
}
You can use curly brackets {{column}}
to tell dplyr that you mean the column name.
This R code creates a function vs_perc_table()
that generates a percentage breakdown table for the 'vs' (V/S - V-shaped or Straight) column in the mtcars
dataset.
The lapply
part applies this function to every column in the dataset. do.call(rbind, results)
combines all these individual column analyses into one final_table
.
library(dplyr)
library(tidyverse)
ctdf <- mtcars
vs_perc_table <- function(column) {
col_quo <- enquo(column)
col_name <- quo_name(col_quo)
ctdf %>%
group_by({{column}}, vs) %>%
mutate(n = n()) %>%
group_by({{column}}) %>%
distinct({{column}}, vs, n) %>%
mutate(Per = n/sum(n),
np = paste0(n, " (", round(Per*100, 2), " %)"),
col = col_name,
val = {{column}}) %>%
select(-n, -Per) %>%
spread(vs, np) %>%
ungroup() %>%
select(-!!col_quo) # deselect the column, because it is already denoted in col
}
vs_perc_table(gear)
# Get the column symbols
column_syms <- syms(colnames(ctdf)[colnames(ctdf) != "vs"])
# Use lapply with the column symbols for a dataframe list
results <- lapply(column_syms, function(col) vs_perc_table(!!sym(col)))
# If you want to combine all results
final_table <- do.call(rbind, results)
Final Table (Head)
col | val | 0 | 1 |
---|---|---|---|
cyl | 4.000 | 1 (9.09 %) | 10 (90.91 %) |
cyl | 6.000 | 3 (42.86 %) | 4 (57.14 %) |
cyl | 8.000 | 14 (100 %) | NA |
gear | 3.000 | 12 (80 %) | 3 (20 %) |
gear | 4.000 | 2 (16.67 %) | 10 (83.33 %) |
gear | 5.000 | 4 (80 %) | 1 (20 %) |
until now I just showed you how to do it with your function. However we can do the same without a function or lapply:
final_table <- mtcars %>%
pivot_longer(cols = -c(vs), names_to = "col", values_to = "val") %>%
group_by(col, val, vs) %>%
summarise(n = n()) %>%
group_by(col, val) %>%
mutate(np = paste0(n, " (", round(n/sum(n)*100, 2), " %)")) %>%
select(-n) %>%
pivot_wider(names_from = vs, values_from = np, names_prefix = "vs_")
which we can adapt for your case:
final_table <- ctdf %>%
pivot_longer(cols = -c(sex), names_to = "col", values_to = "val") %>%
group_by(col, val, sex) %>%
summarise(n = n()) %>%
group_by(col, val) %>%
mutate(np = paste0(n, " (", round(n/sum(n)*100, 2), " %)")) %>%
select(-n) %>%
pivot_wider(names_from = sex, values_from = np, names_prefix = "sex_")