Search code examples
rdplyrstatisticsintervals

Create an intervals' column in R using quantiles and compute an average after that


I have a dataset like that, representing contracts [invented data here due to confidentiality] :

field subgroup duration_days amount_per_day
72145 7214598 700 8095800.2
72145 7214598 541 5039689.1
72145 7214554 98 3801372.8
72145 7214554 900 2444928.7
90385 9038529 100 2861857.0
90385 9038529 284 1736789.4

I'd like to obtain 2 other columns : one concerning intervals of days, based on 33th and 66th quantiles, and another that compute the average amount per day grouped by field and subgroup.

Desired output

field subgroup days_intervals avg_amount_per_day
72145 7214598 0-33th quantile Value1
72145 7214598 33th-66th quantile Value2
72145 7214598 66th quantile - (duration of the longest contract) Value3

And so on for all the field and subgroup, with of course Value 1,2,3 that are numbers as well as the values of quantiles.

When i use the following code :

f <- data.frame(df %>%  dplyr::group_by(field, subgroup) %>% dplyr::summarize(quants = quantile(duration_days, probs = c(0.33, 0.66))))

I obtain a dataframe with 3 columns : field, subgroup, quants, with 1 row per grouping and quantile value (so 2 rows everytime).

I'm wondering if R is a good tool to do that, but i can use Python if it's easier for this.


Solution

  • You may take help of cut -

    library(dplyr)
    
    df %>%  
      group_by(field, subgroup) %>%
      group_by(quants = cut(duration_days, quantile(duration_days, c(0, .33, .66, 1)), 
                            labels = c('33%', '66%', '100%'), include.lowest = TRUE), .add = TRUE) %>%
      summarise(avg_amount_per_day = mean(amount_per_day, na.rm = TRUE), .groups = 'drop')