Group by and summarize percentage based on dichotomous variable

Hi I have this dataset here: diagnoses_2_or_more and diagnoses_3_or_more are categorical where 1 indicates yes and 0 indicates no.

id <- c(1,2,3,4,5,6,7)
grp <- c("1","1","1","2","2","3","3")
diagnosis_2_or_more <- c(1,1,0,1,0,1,0)
diagnosis_3_or_more <- c(1,0,1,1,1,0,1)

df <- data.frame(id,grp,diagnosis_2_or_more,diagnosis_3_or_more)

I want to calculate the percentage of people who have 2 or more diagnoses and who have 3 or more diagnoses for each group.

The desired dataset would look like this:

id <- c(1,2,3,4,5,6,7)
grp <- c("1","1","1","2","2","3","3")
diagnosis_2_or_more <- c(1,1,0,1,0,1,0)
diagnosis_3_or_more <- c(1,0,1,1,1,0,1)
perc_2_or_more <- c(0.67,0.67,0.67,0.5,0.5,0.5,0.5)
perc_3_or_more <- c(0.67,0.67,0.67,0.5,1,0.5,0.5)

df <- data.frame(id,grp,diagnosis_2_or_more,diagnosis_3_or_more,perc_2_or_more,perc_3_or_more)

For example for group 1, percentage of people who have 2 or more diagnoses would be calculated as 2/3 (2: number of people who have 2 or more diagnoses [coded as 1], total people in group 1: 3).

Is there a way to do this with group by and summarize or by any other way?

I would appreciate all the help there is! Thanks!!!

Solution

For more then 2 columns:

library(dplyr)
library(stringr)

  df %>%
    group_by(grp) %>%
    mutate(across(diagnosis_2_or_more:diagnosis_3_or_more, ~mean(.x),
                  .names = "perc_{str_replace(.col, 'diagnosis', '')}"))

For 2 columns only

library(dplyr)

  df %>%
  group_by(grp) %>%
  mutate(perc_2_or_more = mean(diagnosis_2_or_more),
            perc_3_or_more = mean(diagnosis_3_or_more))

     id grp   diagnosis_2_or_more diagnosis_3_or_more perc_2_or_more perc_3_or_more
  <dbl> <chr>               <dbl>               <dbl>          <dbl>          <dbl>
1     1 1                       1                   1          0.667          0.667
2     2 1                       1                   0          0.667          0.667
3     3 1                       0                   1          0.667          0.667
4     4 2                       1                   1          0.5            1    
5     5 2                       0                   1          0.5            1    
6     6 3                       1                   0          0.5            0.5  
7     7 3                       0                   1          0.5            0.5

Mean is used to calcluate the percentage because we have dichotomous variables (0 or 1).

The mean of a dichotomous variable is the proportion of observations that have a value of 1. This proportion can also be interpreted as a percentage by multiplying it by 100.