Search code examples
rdplyrgroup-bysummarize

Nest multiple dplyr::summarise with different grouping variables


I have a data frame with 100 records including bmi class (over or below 30), waist circumference class (over or below threshold) and outcome variable (deceased 0 or 1).

set.seed(1)
data <- 
tibble(bmiclass=sample(x=c(0,1), size=100, replace = TRUE),
       wcclass=sample(x=c(0,1), size=100, replace = TRUE),
       deceased=sample(x=c(0,1), size=100, replace = TRUE))

I'd need to get two information in the same table: 1) percentage of subjects in the higher WC class by BMI group and 2) the risk of death by BMI group and WC class. I managed to do this by joining two dplyr::group_by and dplyr::summarise by left_join function as follows:

data %>% group_by(bmiclass, wcclass) %>% dplyr::summarise(risk.death=sum(deceased)/n()*100) %>% 
  left_join(data %>% group_by(bmiclass) %>% dplyr::summarise(risk.wc=sum(wcclass)/n()*100), by="bmiclass")

BUT i'm wondering if there is a more straightforward way to do it simpler without left_join?


Solution

  • This will equivalently do the same thing

    data %>% 
      group_by(bmiclass) %>%
      mutate(risk.wc = sum(wcclass)/n()*100) %>%
      group_by(bmiclass, wcclass, risk.wc) %>% summarise(risk.death=sum(deceased)/n()*100)
    
    # A tibble: 4 x 4
    # Groups:   bmiclass, wcclass [4]
      bmiclass wcclass risk.wc risk.death
         <dbl>   <dbl>   <dbl>      <dbl>
    1        0       0    49.0       52  
    2        0       1    49.0       50  
    3        1       0    45.1       64.3
    4        1       1    45.1       56.5
    

    Check it with your code

    > data %>% group_by(bmiclass, wcclass) %>% dplyr::summarise(risk.death=sum(deceased)/n()*100) %>% 
    +   left_join(data %>% group_by(bmiclass) %>% dplyr::summarise(risk.wc=sum(wcclass)/n()*100), by="bmiclass")
    `summarise()` has grouped output by 'bmiclass'. You can override using the `.groups` argument.
    # A tibble: 4 x 4
    # Groups:   bmiclass [2]
      bmiclass wcclass risk.death risk.wc
         <dbl>   <dbl>      <dbl>   <dbl>
    1        0       0       52      49.0
    2        0       1       50      49.0
    3        1       0       64.3    45.1
    4        1       1       56.5    45.1