I have the following two dataframes:
data <- data.frame(
yyyymm = c(202401, 202401,202401,202401,202402,202402,202402,202402),
id = c(1,2,3,4,1,2,3,4),
value = c(1,200,400,99999,2,400,200,180),
quintile = c(1,3,4,5,1,4,3,2) #this is the desired output
)
breakpoints <- data.frame(
yyyymm = c(202401,202402),
q1 = c(-Inf,-Inf),
q2 = c(100,105),
q3 = c(200,185),
q4 = c(250,240),
q5 = c(500,500)
)
where 'data' is panel format and contains one data point for each id in each month. 'breakpoints' contains monthly breakpoints in long format which i want to use as bin breakpoints. For example, quintile 1 in the first month is [-Inf, 100).
I would like to calculate the bin that each 'id' in each 'yyyymm' belongs to according to the breakpoints which are also set monthly. For example in '202401' and for id=2, quintile=3.
Ideally this would be done with dplyr to fit in with rest of the code and would not require case_when or similar since the real breakpoint data has 20 breakpoints.
I tried to left join the breakpoints to 'data' and pass off the breakpoints to the cut function as follows:
data %>%
left_join(breakpoints, by='yyyymm') %>%
mutate(
quintile_calculation = cut(value, breaks=c(q1,q2,q3,q4,q5)))
but i get error:
Error in `mutate()`:
ℹ In argument: `quintile_calculation = cut(value, breaks = c(q1, q2, q3, q4, q5))`.
Caused by error in `cut.default()`:
! 'breaks' are not unique
Your problem is that after you join the data, it looks like this:
data %>%
left_join(breakpoints, by='yyyymm')
# yyyymm id value quintile q1 q2 q3 q4 q5
# 1 202401 1 1 1 -Inf 100 200 250 500
# 2 202401 2 200 3 -Inf 100 200 250 500
# 3 202401 3 400 4 -Inf 100 200 250 500
# 4 202401 4 99999 5 -Inf 100 200 250 500
# 5 202402 1 2 1 -Inf 105 185 240 500
# 6 202402 2 400 4 -Inf 105 185 240 500
# 7 202402 3 200 3 -Inf 105 185 240 500
# 8 202402 4 180 2 -Inf 105 185 240 500
So the values in q1
, q2
, etc. are repeated on each row, but the breaks
argument needs to be unique. A simple fix to make them unique is the function unique()
. You also want to do the cut
by group defined by yyyymm
, so you'll need to use either group_by
or (my preference) the .by
argument.
data %>%
left_join(breakpoints, by='yyyymm') %>%
mutate(
qtile = cut(value, breaks = unique(c(q1,q2,q3,q4,q5))),
.by = yyyymm
)
# yyyymm id value quintile q1 q2 q3 q4 q5 qtile
# 1 202401 1 1 1 -Inf 100 200 250 500 (-Inf,100]
# 2 202401 2 200 3 -Inf 100 200 250 500 (100,200]
# 3 202401 3 400 4 -Inf 100 200 250 500 (250,500]
# 4 202401 4 99999 5 -Inf 100 200 250 500 <NA>
# 5 202402 1 2 1 -Inf 105 185 240 500 (-Inf,105]
# 6 202402 2 400 4 -Inf 105 185 240 500 (240,500]
# 7 202402 3 200 3 -Inf 105 185 240 500 (185,240]
# 8 202402 4 180 2 -Inf 105 185 240 500 (105,185]
To get the integer output you want from cut
, you'll also want to pass labels = FALSE
to cut
, and perhaps you want right = TRUE
, and n
breaks defines n - 1
quantiles, so if you want 5 groups you'll need 6 breaks.
You may find findInterval
easier to use to get what you want, the defaults align with what you've already put together:
data %>%
left_join(breakpoints, by='yyyymm') %>%
mutate(
qtile = findInterval(value, vec = unique(c(q1,q2,q3,q4,q5))),
.by = yyyymm
) |>
select(-matches("^q\\d+$"))
# yyyymm id value quintile qtile
# 1 202401 1 1 1 1
# 2 202401 2 200 3 3
# 3 202401 3 400 4 4
# 4 202401 4 99999 5 5
# 5 202402 1 2 1 1
# 6 202402 2 400 4 4
# 7 202402 3 200 3 3
# 8 202402 4 180 2 2