I am using the following code to get counts and percentages of the proportions of the different levels of a categorical variable.
table(df$var1)
table(df$var1)%>%
prop.table()
The issue is that I have more than 50 categorical variables in my dataset and manually doing this for each one is too time-consuming. Also, I have to note it down in my report on a table. I was wondering if it was possible to include different variables in a code and then the output is a table with multiple categorical variables with counts and percentages. Something like the attached picture in this question. Output table
Here's an example using dplyr
. You use select()
to choose the categorical variables, which can be done by name, column number or one of the tidy select functions (e.g., matches()
, starts_with()
, ends_with()
, contains()
). The rest just makes the counts and turns them into proportions. In the code, all you would have to change is the input data frame and the variables in the call to select()
.
library(dplyr)
library(tidyr)
data(diamonds, package="ggplot2")
diamonds %>%
select(cut:clarity) %>%
mutate(across(everything(), as.character)) %>%
pivot_longer(everything(), values_to="category", names_to="variable") %>%
group_by(variable, category) %>%
tally() %>%
group_by(variable) %>%
mutate(prop = n/sum(n))
#> # A tibble: 20 × 4
#> # Groups: variable [3]
#> variable category n prop
#> <chr> <chr> <int> <dbl>
#> 1 clarity I1 741 0.0137
#> 2 clarity IF 1790 0.0332
#> 3 clarity SI1 13065 0.242
#> 4 clarity SI2 9194 0.170
#> 5 clarity VS1 8171 0.151
#> 6 clarity VS2 12258 0.227
#> 7 clarity VVS1 3655 0.0678
#> 8 clarity VVS2 5066 0.0939
#> 9 color D 6775 0.126
#> 10 color E 9797 0.182
#> 11 color F 9542 0.177
#> 12 color G 11292 0.209
#> 13 color H 8304 0.154
#> 14 color I 5422 0.101
#> 15 color J 2808 0.0521
#> 16 cut Fair 1610 0.0298
#> 17 cut Good 4906 0.0910
#> 18 cut Ideal 21551 0.400
#> 19 cut Premium 13791 0.256
#> 20 cut Very Good 12082 0.224
Created on 2023-05-16 with reprex v2.0.2