I'm working with a large data frame of 14 million rows, containing columns month
, firmID
, and firmSize
. In a separate data frame I have monthly breakpoints (quintiles essentially) for firm size.
My goal is to add a fourth column quintile
to the first data frame. In this column I would have a number from 1 to 5 corresponding to the size quintile the firmSize
belongs to in that specific month.
I have the following loop that does the job but has a runtime of several hundreds of hours.
for (i in 1:length(df$month)) {
for (j in 1:4) {
if (df$size[i] <= breakpoints[which(df$month[i] == breakpoints$month),(j+1)]) {
df$quintile[i] <- j
break()
}
else {
df$quintile[i] <- 5
}
}
}
I have quite limited knowledge of e.g. the applications of dplyr and I was wondering if anyone has an idea about how to approach this problem so that I don't have to keep my laptop running for weeks.
Thank you in advance!
Edit: Example data for the data frames: (thank you Ricardo for your suggestion!)
df
month firmID firmSize
201001 46603210 9738635
201001 72913210 1166077
201001 00621210 3884422
201512 75991610 2932127
201512 45383610 1241272
201512 05766520 1931038
breakpoints
month Q1 Q2 Q3 Q4 Q5
201001 322770 1038300 2112300 4597580 28919700
201512 379340 1239800 2840630 7785700 46209140
I wonder if using findInterval
and data.table
might be worth pursuing and faster. This was adapted from this answer which I thought was very helpful.
findInterval
finds the index of one vector in another (assuming the other is non-decreasing). In this case, breakpoints
columns from Q1
to Q5
forms the second vector, and the function will return the index based on the firmSize
value in the first data frame.
library(data.table)
setDT(df)
setkey(df, month)
setDT(breakpoints)
setkey(breakpoints, month)
df[, quintile := findInterval(firmSize, breakpoints[.BY][, Q1:Q5]) + 1, by = month][]
Output
month firmID firmSize quintile
1: 201001 46603210 9738635 5
2: 201001 72913210 1166077 3
3: 201001 621210 3884422 4
4: 201512 75991610 2932127 4
5: 201512 45383610 1241272 3
6: 201512 5766520 1931038 3