Search code examples
rdataframedplyrtidyverse

Creating custom Quantiles within data frame?


If i have the following table:

tibble(year = c("2020", "2020", "2020","2021", "2021", "2021"),
       website  = c("facebook", "google", "youtube","facebook", "google", "youtube"), 
       method = c("laptop", "laptop", "laptop", "mobile", "mobile", "mobile"), 
       values = c(10,30,60, 90,25, 40))

How can i try to go about creating a column based on a custom q-tile of the numbers in the value column.

For example, if i have the following custom q-tile conditions:

Risky - > 50% Neither - 25-50% Safe - <25%

These are basically saying for the numbers in the values column, calculate what their ranking based on the q-tile conditions above and give them a rank value of 1,2,3 accordingly.

The final table should look like this:

tibble(year = c("2020", "2020", "2020","2021", "2021", "2021"),
       website  = c("facebook", "google", "youtube","facebook", "google", "youtube"), 
       method = c("laptop", "laptop", "laptop", "mobile", "mobile", "mobile"), 
       values = c(10,30,60, 90,25, 40), 
       rank = c(3,2,1,1,3,2))

I know that the table will have to be grouped by year and method along, so the code will look like this:

df %>% group_by(year, method) %>% mutate(rank = quantile(???))

Solution

  • You can use quantile(x, c(0.25, 0.5)) to get cutpoints and pass them into findInterval(). Note that findInterval() is similar to cut(*, labels = FALSE) but is more efficient.

    library(dplyr)
    
    df %>%
      group_by(year, method) %>%
      mutate(rank = findInterval(-values, quantile(-values, c(0.25, 0.5)), left.open = TRUE) + 1) %>%
      ungroup()
    
    # # A tibble: 6 × 5
    #   year  website  method values  rank
    #   <chr> <chr>    <chr>   <dbl> <dbl>
    # 1 2020  facebook laptop     10     3
    # 2 2020  google   laptop     30     2
    # 3 2020  youtube  laptop     60     1
    # 4 2021  facebook mobile     90     1
    # 5 2021  google   mobile     25     3
    # 6 2021  youtube  mobile     40     2
    

    If you want labels rather than ranks, use cut():

    df %>%
      group_by(year, method) %>%
      mutate(rank = cut(values, quantile(values, c(0, 0.25, 0.5, 1)),
                        c("Safe", "Neither", "Risky"), include.lowest = TRUE)) %>%
      ungroup()
    
    # # A tibble: 6 × 5
    #   year  website  method values rank   
    #   <chr> <chr>    <chr>   <dbl> <fct>  
    # 1 2020  facebook laptop     10 Safe   
    # 2 2020  google   laptop     30 Neither
    # 3 2020  youtube  laptop     60 Risky  
    # 4 2021  facebook mobile     90 Risky  
    # 5 2021  google   mobile     25 Safe   
    # 6 2021  youtube  mobile     40 Neither