Search code examples
rdplyrbinning

R: How to bin efficiently using time-varying breakpoints?


I'm working with a large data frame of 14 million rows, containing columns month, firmID, and firmSize. In a separate data frame I have monthly breakpoints (quintiles essentially) for firm size.

My goal is to add a fourth column quintile to the first data frame. In this column I would have a number from 1 to 5 corresponding to the size quintile the firmSize belongs to in that specific month.

I have the following loop that does the job but has a runtime of several hundreds of hours.

for (i in 1:length(df$month)) {
  for (j in 1:4) {
    if (df$size[i] <= breakpoints[which(df$month[i] == breakpoints$month),(j+1)]) {
      df$quintile[i] <- j
      break()
    }
    else {
      df$quintile[i] <- 5
    }
  }
}

I have quite limited knowledge of e.g. the applications of dplyr and I was wondering if anyone has an idea about how to approach this problem so that I don't have to keep my laptop running for weeks.

Thank you in advance!

Edit: Example data for the data frames: (thank you Ricardo for your suggestion!)

df

month  firmID   firmSize
201001 46603210 9738635
201001 72913210 1166077
201001 00621210 3884422
201512 75991610 2932127
201512 45383610 1241272
201512 05766520 1931038

breakpoints

month  Q1     Q2      Q3      Q4      Q5
201001 322770 1038300 2112300 4597580 28919700
201512 379340 1239800 2840630 7785700 46209140

Solution

  • I wonder if using findInterval and data.table might be worth pursuing and faster. This was adapted from this answer which I thought was very helpful.

    findInterval finds the index of one vector in another (assuming the other is non-decreasing). In this case, breakpoints columns from Q1 to Q5 forms the second vector, and the function will return the index based on the firmSize value in the first data frame.

    library(data.table)
    
    setDT(df)
    setkey(df, month)
    
    setDT(breakpoints)
    setkey(breakpoints, month)
    
    df[, quintile := findInterval(firmSize, breakpoints[.BY][, Q1:Q5]) + 1, by = month][]
    

    Output

        month   firmID firmSize quintile
    1: 201001 46603210  9738635        5
    2: 201001 72913210  1166077        3
    3: 201001   621210  3884422        4
    4: 201512 75991610  2932127        4
    5: 201512 45383610  1241272        3
    6: 201512  5766520  1931038        3