I have a dataset like that, representing contracts [invented data here due to confidentiality] :
field | subgroup | duration_days | amount_per_day |
---|---|---|---|
72145 | 7214598 | 700 | 8095800.2 |
72145 | 7214598 | 541 | 5039689.1 |
72145 | 7214554 | 98 | 3801372.8 |
72145 | 7214554 | 900 | 2444928.7 |
90385 | 9038529 | 100 | 2861857.0 |
90385 | 9038529 | 284 | 1736789.4 |
I'd like to obtain 2 other columns : one concerning intervals of days, based on 33th and 66th quantiles, and another that compute the average amount per day grouped by field and subgroup.
Desired output
field | subgroup | days_intervals | avg_amount_per_day |
---|---|---|---|
72145 | 7214598 | 0-33th quantile | Value1 |
72145 | 7214598 | 33th-66th quantile | Value2 |
72145 | 7214598 | 66th quantile - (duration of the longest contract) | Value3 |
And so on for all the field and subgroup, with of course Value 1,2,3 that are numbers as well as the values of quantiles.
When i use the following code :
f <- data.frame(df %>% dplyr::group_by(field, subgroup) %>% dplyr::summarize(quants = quantile(duration_days, probs = c(0.33, 0.66))))
I obtain a dataframe with 3 columns : field, subgroup, quants, with 1 row per grouping and quantile value (so 2 rows everytime).
I'm wondering if R is a good tool to do that, but i can use Python if it's easier for this.
You may take help of cut
-
library(dplyr)
df %>%
group_by(field, subgroup) %>%
group_by(quants = cut(duration_days, quantile(duration_days, c(0, .33, .66, 1)),
labels = c('33%', '66%', '100%'), include.lowest = TRUE), .add = TRUE) %>%
summarise(avg_amount_per_day = mean(amount_per_day, na.rm = TRUE), .groups = 'drop')