Search code examples
rloopsdplyrdynamic-variables

What is the simplest way to compute the average of one variable grouped by a second variable, iterating over all second variables dplyr?


I have a data frame with a large number of variables, one of them, the probability of death to be predicted by all others. As a preliminary step I want to compute the PoD by counting the death rate in bins of each variable.

let's say df <- (age = c(25, 57, 60), weight = (80, 92, 61), cigarettes_a_day = c(30, 2, 19), death_flag=c(1,0,1))

Then I can group by age (say under 50 and over 50) and compute the PoD as the death rate of one group as the count of death_flags divided by the number of people falling into the group, or simply the average death_flag. When grouping by weight(say below and above 80) I will obtain a different death rate and thus a different PoD, for each binned variable, which is what I want. My problem arises when trying to iterate through all variables.

So far I've tried variants of the following piece of code, which however does not work:

for(n in names(df)) {

    df%>% group_by(n)%>%
      summarise(PoD_bin = mean(death_flag))
}

I haven't figured out a way to run through all variables and perform the computation.

As a side note, the binning of variables I have done without dplyr by:

for(v in names(df[-1])){
    newVar <- paste(f, "bin", sep = "_")
    df[newVar] <- cut(as.matrix(df[v]), breaks = 100)
}

I am irritated, that I cannot refer to the variables in the first for loop for the grouping, while I can do so in the second to create new columns of the df.

Help is greatly appreciated!


Solution

  • Your loop doesn't work because a character is parsed to group_by. You could modify your loop a little bit and get the desired result. I have added print() to see the output.

    for (n in names(df)) {
      
      df |>
        group_by(!!sym(n)) |>
        summarise(PoD_bin = mean(death_flag)) |>
        print()
      
    }
    

    Output:

    # A tibble: 3 × 2
        age PoD_bin
      <dbl>   <dbl>
    1    25       1
    2    57       0
    3    60       1
    # A tibble: 3 × 2
      weight PoD_bin
       <dbl>   <dbl>
    1     61       1
    2     80       1
    3     92       0
    # A tibble: 3 × 2
      cigarettes_a_day PoD_bin
                 <dbl>   <dbl>
    1                2       0
    2               19       1
    3               30       1
    # A tibble: 2 × 2
      death_flag PoD_bin
           <dbl>   <dbl>
    1          0       0
    2          1       1
    

    Data:

    df <- tibble(age = c(25, 57, 60), weight = c(80, 92, 61), cigarettes_a_day = c(30, 2, 19), death_flag=c(1,0,1))