Search code examples
rdata-manipulation

R: Calculating Proportions Based on Nested Groups


I am working with the R programming language.

I have the following dataset:

set.seed(123)
library(dplyr)

Patient_ID = 1:5000
gender <- c("Male","Female")
gender <- sample(gender, 5000, replace=TRUE, prob=c(0.45, 0.55))
Gender <- as.factor(gender)


status <- c("Immigrant","Citizen")
status <- sample(status, 5000, replace=TRUE, prob=c(0.3, 0.7))
Status  <- as.factor(status )

Height = rnorm(5000, 150, 10)
Weight = rnorm(5000, 90, 10)
Hospital_Visits = sample.int(20,  5000, replace = TRUE)

################

disease <- c("Yes","No")
disease <- sample(disease, 5000, replace=TRUE, prob=c(0.4, 0.6))
Disease <- as.factor(disease)

###################
my_data = data.frame(Patient_ID, Gender, Status, Height, Weight, Hospital_Visits, Disease)

  Patient_ID Gender    Status   Height    Weight Hospital_Visits Disease
1          1 Female   Citizen 145.0583 113.70725               1      No
2          2   Male Immigrant 161.2759  88.33188              18      No
3          3 Female Immigrant 138.5305  99.26961               6     Yes
4          4   Male   Citizen 164.8102  84.31848              12      No
5          5   Male   Citizen 159.1619  92.25090              12     Yes
6          6 Female   Citizen 153.3513 101.31986              11     Yes

I am trying to accomplish the following task:

  • Select all rows where Gender = Male AND Status = Citizen
  • Based on this condition, split the "Height" variable into 3 "ntiles"
  • Now, based on these (nested) cumulative conditions (i.e. Gender = Male & Status = Citizen & Height_ntile = 1 , Gender = Male & Status = Citizen & Height_ntile = 2, Gender = Male & Status = Citizen & Height_ntile = 3), split the "Weight" variable into 3 "ntiles"
  • Now, based on these added cumulative conditions (i.e. Gender = Male & Status = Citizen & Height_ntile = 1, Weight_ntile = 1,etc.), split the "Hospital_Visits" into 3 "ntiles"
  • Finally, based on these added cumulative conditions (ender = Male & Status = Citizen & Height_ntile = 1, Weight_ntile = 1, Hosptial_Visits_ntile = 1, etc.), calculate the percentage of patients who have the disease
  • Repeat for Gender = Female AND Status = Citizen, etc.

Based on a previous question (R: Interchanging "Quantile" and "Ntile" Functions?), I tried to adapt the code for this question:

    results <- my_data %>%
  group_by(Gender, Status) %>%
  mutate(group = ntile(Height, 3)) %>%
  group_by(Gender, Status, group) %>%
  mutate(min = min(Height),
         max = max(Height)) %>%
  mutate(range_Height = paste(min, max, sep = "-")) %>%
  ungroup() %>%
 group_by(Gender, Status, range_Height) %>%
  mutate(group = ntile(Weight, 3)) %>%
  group_by(Gender, Status, range_Height, group) %>%
  mutate(min = min(Weight),
         max = max(Weight)) %>%
  mutate(range_Weight = paste(min, max, sep = "-")) %>%
  ungroup() %>%
 group_by(Gender, Status, range_Height, range_Weight) %>%
  mutate(group = ntile(Hospital_Visits, 3)) %>%
  group_by(Gender, Status, range_Height, range_Weight, group) %>%
  mutate(min = min(Hospital_Visits),
         max = max(Hospital_Visits)) %>%
  mutate(range_Hospital_Visits = paste(min, max, sep = "-")) %>%
  ungroup() %>% group_by (Gender, Status, range_Height, range_Weight, range_Hospital_Visits) %>% summarise(disease_rate = mean(Disease), count = n())

The code seems to have run - but with several warnings and the "disease_rate" column being filled with NA's:

There were 50 or more warnings (use warnings() to see the first 50)
> results
# A tibble: 108 x 7
# Groups:   Gender, Status, range_Height, range_Weight [36]
   Gender Status  range_Height                      range_Weight                      range_Hospita~1 disea~2 count
   <fct>  <fct>   <chr>                             <chr>                             <chr>             <dbl> <int>
 1 Female Citizen 115.862119946011-145.511095001117 58.9918455266656-85.4984683588085 1-7                  NA    72
 2 Female Citizen 115.862119946011-145.511095001117 58.9918455266656-85.4984683588085 14-20                NA    71
 3 Female Citizen 115.862119946011-145.511095001117 58.9918455266656-85.4984683588085 8-14                 NA    72
 4 Female Citizen 115.862119946011-145.511095001117 85.5651200062572-94.2555060657397 1-8                  NA    72
 5 Female Citizen 115.862119946011-145.511095001117 85.5651200062572-94.2555060657397 14-20                NA    71
 6 Female Citizen 115.862119946011-145.511095001117 85.5651200062572-94.2555060657397 8-14                 NA    72
 7 Female Citizen 115.862119946011-145.511095001117 94.342739710942-121.024202204596  1-6                  NA    72
 8 Female Citizen 115.862119946011-145.511095001117 94.342739710942-121.024202204596  14-20                NA    71
 9 Female Citizen 115.862119946011-145.511095001117 94.342739710942-121.024202204596  7-14                 NA    71
10 Female Citizen 145.52366174493-153.896976138595  60.0595633761862-86.5042722953509 1-8                  NA    72
# ... with 98 more rows, and abbreviated variable names 1: range_Hospital_Visits, 2: disease_rate
# i Use `print(n = ...)` to see more rows

Can someone please show me how to fix this problem?

Thanks!

Note: This should NOT be the same as grouping by a variable and calculating ntiles - then interrupt the process, group by previous variable and previous ntiles ... and then calculate new ntiles. I would like the grouping and ntile calculations to be uninterrupted.


Solution

  • OP's answer is close, but not right. group_by needs .add = TRUE when you don't want to override.

    my_data %>% 
      group_by(Gender, Status) %>%
      mutate(Height_ntile = ntile(Height, 3)) %>%
      group_by(Height_ntile, .add = TRUE) %>%
      mutate(Weight_ntile = ntile(Weight, 3)) %>%
      group_by(Weight_ntile, .add = TRUE) %>%
      mutate(Hospital_Visits_ntile = ntile(Hospital_Visits, 3)) %>%
      group_by(Hospital_Visits_ntile, .add = TRUE) %>%
      summarize(percent_disease = mean(Disease == "Yes"), 
                count = n(),
                .groups = "drop")
    

    [addition]
    The default value of .add is FALSE.
    It means when you use group_by twice, the group made by 1st group_by was dropped and new group is made (-> override).
    Below is an example:

    my_data %>% 
      group_by(Gender, Status) %>%
      mutate(Height_ntile = ntile(Height, 3)) %>%
      group_by(Height_ntile)  # the default value of .add is FALSE
    
    # A tibble: 5,000 × 8
    # Groups:   Height_ntile [3]
    # ...
    
    my_data %>% 
      group_by(Gender, Status) %>%
      mutate(Height_ntile = ntile(Height, 3)) %>%
      group_by(Height_ntile, .add = TRUE)
    
    # A tibble: 5,000 × 8
    # Groups:   Gender, Status, Height_ntile [12]
    

    I couldn't come up with good ideas about range, so I did it by the simple but long code:

    my_data %>% 
      group_by(Gender, Status) %>%
      mutate(Height_ntile = ntile(Height, 3),
             Height_range = paste(min(Height), max(Height), sep = "-")) %>%
      group_by(Height_ntile, Height_range, .add = TRUE) %>%
      mutate(Weight_ntile = ntile(Weight, 3),
             Weight_range = paste(min(Weight), max(Weight), sep = "-")) %>%
      group_by(Weight_ntile, Weight_range, .add = TRUE) %>%
      mutate(Hospital_Visits_ntile = ntile(Hospital_Visits, 3),
             Hospital_range = paste(min(Hospital_Visits), max(Hospital_Visits), sep = "-")) %>%
      group_by(Hospital_Visits_ntile, Hospital_range, .add = TRUE) %>%
      summarize(percent_disease = mean(Disease == "Yes"), 
                count = n(),
                .groups = "drop")