Search code examples
rcutbins

Cut() with custom bins


I have the following two dataframes:

data <- data.frame(
  yyyymm = c(202401, 202401,202401,202401,202402,202402,202402,202402),
  id = c(1,2,3,4,1,2,3,4),
  value = c(1,200,400,99999,2,400,200,180),
  quintile = c(1,3,4,5,1,4,3,2) #this is the desired output
)

breakpoints <- data.frame(
  yyyymm = c(202401,202402),
  q1 = c(-Inf,-Inf),
  q2 = c(100,105),
  q3 = c(200,185),
  q4 = c(250,240),
  q5 = c(500,500)
)

where 'data' is panel format and contains one data point for each id in each month. 'breakpoints' contains monthly breakpoints in long format which i want to use as bin breakpoints. For example, quintile 1 in the first month is [-Inf, 100).

I would like to calculate the bin that each 'id' in each 'yyyymm' belongs to according to the breakpoints which are also set monthly. For example in '202401' and for id=2, quintile=3.

Ideally this would be done with dplyr to fit in with rest of the code and would not require case_when or similar since the real breakpoint data has 20 breakpoints.

I tried to left join the breakpoints to 'data' and pass off the breakpoints to the cut function as follows:

data %>%
  left_join(breakpoints, by='yyyymm') %>%
  mutate(
    quintile_calculation = cut(value, breaks=c(q1,q2,q3,q4,q5)))

but i get error:

Error in `mutate()`:
ℹ In argument: `quintile_calculation = cut(value, breaks = c(q1, q2, q3, q4, q5))`.
Caused by error in `cut.default()`:
! 'breaks' are not unique

Solution

  • Your problem is that after you join the data, it looks like this:

    data %>%
       left_join(breakpoints, by='yyyymm')
    #   yyyymm id value quintile   q1  q2  q3  q4  q5
    # 1 202401  1     1        1 -Inf 100 200 250 500
    # 2 202401  2   200        3 -Inf 100 200 250 500
    # 3 202401  3   400        4 -Inf 100 200 250 500
    # 4 202401  4 99999        5 -Inf 100 200 250 500
    # 5 202402  1     2        1 -Inf 105 185 240 500
    # 6 202402  2   400        4 -Inf 105 185 240 500
    # 7 202402  3   200        3 -Inf 105 185 240 500
    # 8 202402  4   180        2 -Inf 105 185 240 500
    

    So the values in q1, q2, etc. are repeated on each row, but the breaks argument needs to be unique. A simple fix to make them unique is the function unique(). You also want to do the cut by group defined by yyyymm, so you'll need to use either group_by or (my preference) the .by argument.

    data %>%
      left_join(breakpoints, by='yyyymm') %>%
      mutate(
        qtile = cut(value, breaks = unique(c(q1,q2,q3,q4,q5))),
        .by = yyyymm
      )
    #   yyyymm id value quintile   q1  q2  q3  q4  q5      qtile
    # 1 202401  1     1        1 -Inf 100 200 250 500 (-Inf,100]
    # 2 202401  2   200        3 -Inf 100 200 250 500  (100,200]
    # 3 202401  3   400        4 -Inf 100 200 250 500  (250,500]
    # 4 202401  4 99999        5 -Inf 100 200 250 500       <NA>
    # 5 202402  1     2        1 -Inf 105 185 240 500 (-Inf,105]
    # 6 202402  2   400        4 -Inf 105 185 240 500  (240,500]
    # 7 202402  3   200        3 -Inf 105 185 240 500  (185,240]
    # 8 202402  4   180        2 -Inf 105 185 240 500  (105,185]
    

    To get the integer output you want from cut, you'll also want to pass labels = FALSE to cut, and perhaps you want right = TRUE, and n breaks defines n - 1 quantiles, so if you want 5 groups you'll need 6 breaks.

    You may find findInterval easier to use to get what you want, the defaults align with what you've already put together:

    data %>%
      left_join(breakpoints, by='yyyymm') %>%
      mutate(
        qtile = findInterval(value, vec = unique(c(q1,q2,q3,q4,q5))),
        .by = yyyymm
      ) |>
      select(-matches("^q\\d+$"))
    #   yyyymm id value quintile qtile
    # 1 202401  1     1        1     1
    # 2 202401  2   200        3     3
    # 3 202401  3   400        4     4
    # 4 202401  4 99999        5     5
    # 5 202402  1     2        1     1
    # 6 202402  2   400        4     4
    # 7 202402  3   200        3     3
    # 8 202402  4   180        2     2