Search code examples
rgroup-bysummarize

Group by and summarize percentage based on dichotomous variable


Hi I have this dataset here: diagnoses_2_or_more and diagnoses_3_or_more are categorical where 1 indicates yes and 0 indicates no.

id <- c(1,2,3,4,5,6,7)
grp <- c("1","1","1","2","2","3","3")
diagnosis_2_or_more <- c(1,1,0,1,0,1,0)
diagnosis_3_or_more <- c(1,0,1,1,1,0,1)

df <- data.frame(id,grp,diagnosis_2_or_more,diagnosis_3_or_more)

I want to calculate the percentage of people who have 2 or more diagnoses and who have 3 or more diagnoses for each group.

The desired dataset would look like this:

id <- c(1,2,3,4,5,6,7)
grp <- c("1","1","1","2","2","3","3")
diagnosis_2_or_more <- c(1,1,0,1,0,1,0)
diagnosis_3_or_more <- c(1,0,1,1,1,0,1)
perc_2_or_more <- c(0.67,0.67,0.67,0.5,0.5,0.5,0.5)
perc_3_or_more <- c(0.67,0.67,0.67,0.5,1,0.5,0.5)

df <- data.frame(id,grp,diagnosis_2_or_more,diagnosis_3_or_more,perc_2_or_more,perc_3_or_more)

For example for group 1, percentage of people who have 2 or more diagnoses would be calculated as 2/3 (2: number of people who have 2 or more diagnoses [coded as 1], total people in group 1: 3).

Is there a way to do this with group by and summarize or by any other way?

I would appreciate all the help there is! Thanks!!!


Solution

  • For more then 2 columns:

    library(dplyr)
    library(stringr)
    
      df %>%
        group_by(grp) %>%
        mutate(across(diagnosis_2_or_more:diagnosis_3_or_more, ~mean(.x),
                      .names = "perc_{str_replace(.col, 'diagnosis', '')}"))
    

    For 2 columns only

    library(dplyr)
    
      df %>%
      group_by(grp) %>%
      mutate(perc_2_or_more = mean(diagnosis_2_or_more),
                perc_3_or_more = mean(diagnosis_3_or_more))
    
         id grp   diagnosis_2_or_more diagnosis_3_or_more perc_2_or_more perc_3_or_more
      <dbl> <chr>               <dbl>               <dbl>          <dbl>          <dbl>
    1     1 1                       1                   1          0.667          0.667
    2     2 1                       1                   0          0.667          0.667
    3     3 1                       0                   1          0.667          0.667
    4     4 2                       1                   1          0.5            1    
    5     5 2                       0                   1          0.5            1    
    6     6 3                       1                   0          0.5            0.5  
    7     7 3                       0                   1          0.5            0.5  
    

    Mean is used to calcluate the percentage because we have dichotomous variables (0 or 1).

    The mean of a dichotomous variable is the proportion of observations that have a value of 1. This proportion can also be interpreted as a percentage by multiplying it by 100.