I want to calculate count of rows before current row in previous 1 year window by id.
Here's my data:
df <- structure(list(id = c("1", "1", "1", "1",
"2", "2", "2", "2", "2", "2", "2",
"2", "2"), flag = c(1, 1, 0, 1, 0, 0, 1, 1,
1, 1, 1, 1, 1), date = structure(c(15425, 15456, 16613,
16959, 15513, 15513, 15625, 15635, 15649, 15663, 15670, 16051,
16052), class = "Date")), sorted = "id", class = c("data.table",
"data.frame"), row.names = c(NA, -13L))
roll_sum <- c(0, 1, 0, 1, 0, 1, 2, 3, 4, 5, 6, 0, 1)
flag_sum <- c(0, 1, 0, 0, 0, 0, 0, 1, 2, 3, 4, 0, 1)
df_desired <- cbind(df, roll_sum) # roll_sum: number of rows excluding current row in 1 year time frame rolling
df_desired <- cbind(df_desired, flag_sum) # flag_sum: number of rows excluding current row in 1 year time frame rolling where flag was 1
Data:
id flag date
1: 1 1 2012-03-26
2: 1 1 2012-04-26
3: 1 0 2015-06-27
4: 1 1 2016-06-07
5: 2 0 2012-06-22
6: 2 0 2012-06-22
7: 2 1 2012-10-12
8: 2 1 2012-10-22
9: 2 1 2012-11-05
10: 2 1 2012-11-19
11: 2 1 2012-11-26
12: 2 1 2013-12-12
13: 2 1 2013-12-13
Output:
df_desired
id flag date roll_sum flag_sum
1: 1 1 2012-03-26 0 0
2: 1 1 2012-04-26 1 1
3: 1 0 2015-06-27 0 0
4: 1 1 2016-06-07 1 0
5: 2 0 2012-06-22 0 0
6: 2 0 2012-06-22 1 0
7: 2 1 2012-10-12 2 0
8: 2 1 2012-10-22 3 1
9: 2 1 2012-11-05 4 2
10: 2 1 2012-11-19 5 3
11: 2 1 2012-11-26 6 4
12: 2 1 2013-12-12 0 0
13: 2 1 2013-12-13 1 1
I tried solution given by G. Grothendieck using zoo
in Compute rolling sum by id variables, with missing timepoints, but it's giving me an error:
Error in merge.zoo(z, g) : series cannot be merged with non-unique index entries in a series In addition: Warning message: In zoo(count, date) :
I made date column unique using make.index.unique
and make.time.unique
.
Any help with an optimized solution would be appreciated.Thanks.
Not sure this will be helpful with the dimension of your data.
First, create running index to handle duplicate date and roll sum must not include prev dupe date and also create date one year ago (i would argue that 365 is better but seems like OP wants 366).
Then, perform a non-equi self-join while ensuring prev dupe date not used and dates are within a year.
df[, c("rn", "oneYrAgo") := .(.I, date - 366)]
df[df,
.(roll_sum=.N, flag_sum=sum(flag, na.rm=TRUE)),
on=.(date >= oneYrAgo, rn < rn, id, date <= date),
by=.EACHI][,
-seq_len(2L)]
result:
id date roll_sum flag_sum
1: 1 2012-03-26 0 0
2: 1 2012-04-26 1 1
3: 1 2015-06-27 0 0
4: 1 2016-06-07 1 0
5: 2 2012-06-22 0 0
6: 2 2012-06-22 1 0
7: 2 2012-10-12 2 0
8: 2 2012-10-22 3 1
9: 2 2012-11-05 4 2
10: 2 2012-11-19 5 3
11: 2 2012-11-26 6 4
12: 2 2013-12-12 0 0
13: 2 2013-12-13 1 1