I am trying to summarise sets of overlapping categories from a "here are 10 categories, pick 3" survey question. Each category has a value if they picked it and an NA if they didn't. Here is a dummy example with 3 categories:
P_A<-c("A","A",NA,NA)
P_B<-c(NA,"B","B",NA)
P_C<-c("C", NA, NA,NA)
grouper<-c("M", "M", "F","F")
data.df<-data.frame(P_A,P_B,P_C,grouper)
I am trying to create a table that shows total number select and percentage (count over number of people that selected at least one option), then also be able to subdivide by a covariate. Eg. the following table or two versions of the following table for M and F (but could be more than 2 categories in the grouper). Note, in my example, the 4th row is someone who selected no categories so is not part of the denominator for the percentage.
Variable | Overall count | Overall % |
---|---|---|
P_A | 2 | 66.67 |
P_B | 2 | 66.67 |
P_C | 1 | 33.33 |
I can produce the figures using colSums(!isna(select(data.df, P_A:P_C)))
and then dividing by sum(rowSums(!is.na(select(data.df, P_A:P_C)))>0)
but I'm struggling to make that more generalisable for other variables and am not sure how to neatly do it when I want separate tables by the grouper variable.
You can remove the rows with all NA
values, get the data in long format and for each column find number and percentage of non-NA values.
library(dplyr)
data.df %>%
select(-grouper) %>%
filter(Reduce(`|`, across(.fns = ~!is.na(.)))) %>%
tidyr::pivot_longer(cols = everything()) %>%
group_by(name) %>%
summarise(count = sum(!is.na(value)),
overall_perc = count/n() * 100)
# name count overall_perc
# <chr> <int> <dbl>
#1 P_A 2 66.7
#2 P_B 2 66.7
#3 P_C 1 33.3
To do this for each category of grouper
you could do :
data.df %>%
filter(Reduce(`|`, across(-grouper, ~!is.na(.)))) %>%
tidyr::pivot_longer(cols = -grouper) %>%
group_by(name, grouper) %>%
summarise(count = sum(!is.na(value)),
overall_perc = count/n() * 100)