Hi I have this dataset here: diagnoses_2_or_more and diagnoses_3_or_more are categorical where 1 indicates yes and 0 indicates no.
id <- c(1,2,3,4,5,6,7)
grp <- c("1","1","1","2","2","3","3")
diagnosis_2_or_more <- c(1,1,0,1,0,1,0)
diagnosis_3_or_more <- c(1,0,1,1,1,0,1)
df <- data.frame(id,grp,diagnosis_2_or_more,diagnosis_3_or_more)
I want to calculate the percentage of people who have 2 or more diagnoses and who have 3 or more diagnoses for each group.
The desired dataset would look like this:
id <- c(1,2,3,4,5,6,7)
grp <- c("1","1","1","2","2","3","3")
diagnosis_2_or_more <- c(1,1,0,1,0,1,0)
diagnosis_3_or_more <- c(1,0,1,1,1,0,1)
perc_2_or_more <- c(0.67,0.67,0.67,0.5,0.5,0.5,0.5)
perc_3_or_more <- c(0.67,0.67,0.67,0.5,1,0.5,0.5)
df <- data.frame(id,grp,diagnosis_2_or_more,diagnosis_3_or_more,perc_2_or_more,perc_3_or_more)
For example for group 1, percentage of people who have 2 or more diagnoses would be calculated as 2/3 (2: number of people who have 2 or more diagnoses [coded as 1], total people in group 1: 3).
Is there a way to do this with group by and summarize or by any other way?
I would appreciate all the help there is! Thanks!!!
For more then 2 columns:
library(dplyr)
library(stringr)
df %>%
group_by(grp) %>%
mutate(across(diagnosis_2_or_more:diagnosis_3_or_more, ~mean(.x),
.names = "perc_{str_replace(.col, 'diagnosis', '')}"))
For 2 columns only
library(dplyr)
df %>%
group_by(grp) %>%
mutate(perc_2_or_more = mean(diagnosis_2_or_more),
perc_3_or_more = mean(diagnosis_3_or_more))
id grp diagnosis_2_or_more diagnosis_3_or_more perc_2_or_more perc_3_or_more
<dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 1 0.667 0.667
2 2 1 1 0 0.667 0.667
3 3 1 0 1 0.667 0.667
4 4 2 1 1 0.5 1
5 5 2 0 1 0.5 1
6 6 3 1 0 0.5 0.5
7 7 3 0 1 0.5 0.5
Mean is used to calcluate the percentage because we have dichotomous variables (0 or 1).
The mean of a dichotomous variable is the proportion of observations that have a value of 1. This proportion can also be interpreted as a percentage by multiplying it by 100.