I've got a large data.table (~50M rows) with multiple ID columns (ID1, ID2) that groups rows. I'd like to shift the Value column based on an inequality rolling condition as i will outline below. what is rolling condition? I just made that term up. It means the condition changes (rolls) too.
> require(data.table)
> DT = data.table(ID1 = c(rep(1,7), rep(2,4)), ID2 = c('A','B','B','C','C','A','A','D','D','E','D'), Value = (1:11))
Output: the Value of the last row with the same ID1 (so group by = ID1) if ID2 of that last row is NOT the same as the ID2 of the row.
> DT
ID1 ID2 Value desired_output
1: 1 A 1 NA -- no previous row with the same ID1 and different ID2
2: 1 B 2 1 -- last row with same ID1 and different ID2 is row 1, so desired_output is the Value of row 1
3: 1 B 3 1 -- last row with same ID1 and different ID2 is row 1, so desired_output is the Value of row 1
4: 1 C 4 3 -- last row with same ID1 and different ID2 is row 3, so desired_output is the Value of row 3
5: 1 C 5 3 -- last row with same ID1 and different ID2 is row 3, so desired_output is the Value of row 3
6: 1 A 6 5 -- last row with same ID1 and different ID2 is row 5, so desired_output is the Value of row 5
7: 1 A 7 5 -- last row with same ID1 and different ID2 is row 5, so desired_output is the Value of row 5
8: 2 D 8 NA -- no previous row with the same ID1 and different ID2
9: 2 D 9 NA -- no previous row with the same ID1 and different ID2
10: 2 E 10 9 -- last row with same ID1 and different ID2 is row 9, so desired_output is the Value of row 9
11: 2 D 11 10 -- last row with same ID1 and different ID2 is row 10, so desired_output is the Value of row 10
Looking for a computationally efficient way of doing this.
Here's one way to do it:
DT[, rleid_id2 := rleid(ID2), by = .(ID1)]
DT[DT, on = .(ID1, rleid_id2 > rleid_id2), val := i.Value]
> DT
ID1 ID2 Value rleid_id2 val
1: 1 A 1 1 NA
2: 1 B 2 2 1
3: 1 B 3 2 1
4: 1 C 4 3 3
5: 1 C 5 3 3
6: 1 A 6 4 5
7: 1 A 7 4 5
8: 2 D 8 1 NA
9: 2 D 9 1 NA
10: 2 E 10 2 9
11: 2 D 11 3 10
Using rleid
we can turn the sequence of ID2
into a numbered id which increments on every change, grouped by ID1. Then we join DT on itself using non-equi condition on rleid_id2
(plus ID1) so that we also extract the value that has the same ID1 but a lower rleid_id2
value.