Search code examples
rdplyrtidyeval

Why does my dplyr percentile calculation not work with tidy evaluation?


I have a tibble with student test data, and I wish to convert these to percentiles using dplyr. For the sake of having a minimal example, imagine the following setup of three students.

require(tidyverse)

tbl <- tibble(Name = c("Alice", "Bob", "Cat"), Test = c(16, 13, 15))

The following code works and yields the desired output.

tbl %>% mutate(TestPercentile = cume_dist(Test) * 100)

# A tibble: 3 x 3
  Name   Test TestPercentile
  <chr> <dbl>          <dbl>
1 Alice    16          100  
2 Bob      13           33.3
3 Cat      15           66.7

However, I actually want to do it programmatically because there are many such columns.

colname <- "Test"
percname <- str_c(colname, "Percentile")
tbl %>% mutate({{percname}} := cume_dist({{colname}}) * 100)

# A tibble: 3 x 3
  Name   Test TestPercentile
  <chr> <dbl>          <dbl>
1 Alice    16            100
2 Bob      13            100
3 Cat      15            100

Why does cume_dist make the percentile 100 for all students when I try to use tidy evaluation like this? (And ideally, if I can be permitted a second question, how can I fix it?)


Solution

  • If by programmatically you mean you want to write your own function, you can do it like this:

    calculate_percentile <- function(data, colname) {
    
       data %>% 
        mutate("{{colname}}Percentile" := cume_dist({{colname}} * 100))
    
    }
    
    tbl %>% 
      calculate_percentile(Test)
    
     # A tibble: 3 x 3
      Name   Test TestPercentile
      <chr> <dbl>          <dbl>
    1 Alice    16          1    
    2 Bob      13          0.333
    3 Cat      15          0.667
    

    Edit for multiple columns New Data

    tbl <- tibble(Name = c("Alice", "Bob", "Cat"), Test = c(16, 13, 15), Test_math = c(16, 30, 55), Test_music = c(3, 78, 34))
    
    calculate_percentile <- function(data, colnames) {
    
      data %>% 
    
        mutate(across({{colnames}}, ~cume_dist(.) * 100, .names = "{col}Percentile"))
    
    }
    
    test_columns <- c("Test_math", "Test_music")
    tbl %>% 
      calculate_percentile(test_columns) 
    
    # A tibble: 3 x 6
      Name   Test Test_math Test_music Test_mathPercentile Test_musicPercentile
      <chr> <dbl>     <dbl>      <dbl>               <dbl>                <dbl>
    1 Alice    16        16          3                33.3                 33.3
    2 Bob      13        30         78                66.7                100  
    3 Cat      15        55         34               100                   66.7
    

    Why does your solution not work? Because your solution applies cume_dist literally to the string "test":

    tbl %>% mutate({{percname}} := print({{colname}}))
    
    [1] "Test"
    # A tibble: 3 x 5
      Name   Test Test_math Test_music TestPercentile
      <chr> <dbl>     <dbl>      <dbl> <chr>         
    1 Alice    16        16          3 Test          
    2 Bob      13        30         78 Test          
    3 Cat      15        55         34 Test 
    

    Why does this give a TestPercentile value of 100? Because cume_dist of "test" is 1:

    cume_dist("test")
    #[1] 1
    

    So we need R to tell not to evaluate the string "test" per se but to look for a variable with this name, which we can do like this:

    tbl %>% mutate({{percname}} := cume_dist(!!parse_quo(colname, env = global_env())) * 100)
    
    # A tibble: 3 x 5
      Name   Test Test_math Test_music TestPercentile
      <chr> <dbl>     <dbl>      <dbl>          <dbl>
    1 Alice    16        16          3          100  
    2 Bob      13        30         78           33.3
    3 Cat      15        55         34           66.7
    
    #Check that this uses the values of "Test" and not "Test" per se:
    tbl %>% mutate({{percname}} := print(!!parse_quo(colname, env = global_env())))
    
    [1] 16 13 15
    # A tibble: 3 x 5
      Name   Test Test_math Test_music TestPercentile
      <chr> <dbl>     <dbl>      <dbl>          <dbl>
    1 Alice    16        16          3             16
    2 Bob      13        30         78             13
    3 Cat      15        55         34             15