I have a large panel dataset (10,000,000 x 53) with about 50 columns of scores. I have aggregated each score by group (there are about 15,000) and date.
Now I want to calculate a rolling sum of three values including the prior two dates' and the current date's scores, creating a new corresponding sum column. The sums should be calculated for each score column by date and group. For 1st and 2nd dates within a group, fewer than 3 values is allowed.
GROUP DATE LAGGED SCORE1 SUM1 SCORE2 SUM2 ... SCORE50 SUM50
#1 A 2017-04-01 2017-03-30 1 1|1 2 2|2 4 4|4
#2 A 2017-04-02 2017-03-31 1 1+1|2 3 3+2|5 3 3+4|7
#3 A 2017-04-04 2017-04-02 2 2+1+1|4 4 4+3+2|9 2 2+3+4|9
#5 B 2017-04-02 2017-03-31 2 2|2 3 3|3 1 1|1
#6 B 2017-04-05 2017-04-03 2 2+2|4 2 2+3|5 1 1+1|2
#7 B 2017-04-08 2017-04-06 3 3+2+2|7 1 1+2+3|6 3 3+1+1|5
#8 C 2017-04-02 2017-03-31 3 3|3 1 1|1 1 1|1
#9 C 2017-04-03 2017-04-01 2 2+3|5 3 3+1|4 2 2+1|3
: : : : : : : : : :
#10M XX 2018-03-30 2018-03-28 2 2 1 1 ... 1 1
David's answer from this post covered most of my questions on summing rolling windows by groups but I'm still missing a couple pieces.
library(data.table) #v1.10.4
## Convert to a proper date class, and add another column
## in order to define the range
setDT(input)[, c("Date", "Date2") := {
Date = as.IDate(Date)
Date2 = Date - 2L
.(Date, Date2)
}]
## Run a non-equi join against the unique Date/Group combination in input
## Sum the Scores on the fly
## You can ignore the second Date column
input[unique(input, by = c("Date", "Group")), ## This removes the dupes
on = .(Group, Date <= Date, Date >= Date2), ## The join condition
.(Score = sum(Score)), ## sum the scores
keyby = .EACHI] ## Run the sum by each row in
## unique(input, by = c("Date", "Group"))
My question has two parts:
A possible solution:
cols <- grep('^SCORE', names(input), value = TRUE)
input[, gsub('SCORE','SUM',cols) := lapply(.SD, cumsum)
, by = GROUP
, .SDcols = cols][]
which gives:
GROUP DATE LAGGED SCORE1 SCORE2 SUM1 SUM2 1: A 2017-04-01 2017-03-30 1 2 1 2 2: A 2017-04-02 2017-03-31 1 3 2 5 3: A 2017-04-04 2017-04-02 2 4 4 9 4: B 2017-04-02 2017-03-31 2 3 2 3 5: B 2017-04-05 2017-04-03 2 2 4 5 6: B 2017-04-08 2017-04-06 3 1 7 6 7: C 2017-04-02 2017-03-31 3 1 3 1 8: C 2017-04-03 2017-04-01 2 3 5 4
When you want to take a time window into account as well, you could do (assuming LAGGED
is the start of the time-window):
input[input[input[, .(GROUP, DATE, LAGGED)]
, on = .(GROUP, DATE >= LAGGED, DATE <= DATE)
][, setNames(lapply(.SD, sum), gsub('SCORE','SUM',cols))
, by = .(GROUP, DATE = DATE.1)
, .SDcols = cols]
, on = .(GROUP, DATE)]
which gives:
GROUP DATE LAGGED SCORE1 SCORE2 SUM1 SUM2 1: A 2017-04-01 2017-03-30 1 2 1 2 2: A 2017-04-02 2017-03-31 1 3 2 5 3: A 2017-04-04 2017-04-02 2 4 3 7 4: B 2017-04-02 2017-03-31 2 3 2 3 5: B 2017-04-05 2017-04-03 2 2 2 2 6: B 2017-04-08 2017-04-06 3 1 3 1 7: C 2017-04-02 2017-03-31 3 1 3 1 8: C 2017-04-03 2017-04-01 2 3 5 4
Used data:
input <- fread(' GROUP DATE LAGGED SCORE1 SCORE2
A 2017-04-01 2017-03-30 1 2
A 2017-04-02 2017-03-31 1 3
A 2017-04-04 2017-04-02 2 4
B 2017-04-02 2017-03-31 2 3
B 2017-04-05 2017-04-03 2 2
B 2017-04-08 2017-04-06 3 1
C 2017-04-02 2017-03-31 3 1
C 2017-04-03 2017-04-01 2 3')