Search code examples
rdata.tableconditional-statementsrolling-computationinequality

rolling value based on an inequality *rolling* condition of a column in data.table R


I've got a large data.table (~50M rows) with multiple ID columns (ID1, ID2) that groups rows. I'd like to shift the Value column based on an inequality rolling condition as i will outline below. what is rolling condition? I just made that term up. It means the condition changes (rolls) too.

> require(data.table)
> DT = data.table(ID1 = c(rep(1,7), rep(2,4)), ID2 = c('A','B','B','C','C','A','A','D','D','E','D'), Value = (1:11))

Output: the Value of the last row with the same ID1 (so group by = ID1) if ID2 of that last row is NOT the same as the ID2 of the row.

> DT
    ID1 ID2 Value desired_output
 1:   1   A     1             NA  -- no previous row with the same ID1 and different ID2
 2:   1   B     2              1  -- last row with same ID1 and different ID2 is row 1, so desired_output is the Value of row 1
 3:   1   B     3              1  -- last row with same ID1 and different ID2 is row 1, so desired_output is the Value of row 1
 4:   1   C     4              3  -- last row with same ID1 and different ID2 is row 3, so desired_output is the Value of row 3
 5:   1   C     5              3  -- last row with same ID1 and different ID2 is row 3, so desired_output is the Value of row 3
 6:   1   A     6              5  -- last row with same ID1 and different ID2 is row 5, so desired_output is the Value of row 5
 7:   1   A     7              5  -- last row with same ID1 and different ID2 is row 5, so desired_output is the Value of row 5
 8:   2   D     8             NA  -- no previous row with the same ID1 and different ID2
 9:   2   D     9             NA  -- no previous row with the same ID1 and different ID2
10:   2   E    10              9  -- last row with same ID1 and different ID2 is row 9, so desired_output is the Value of row 9
11:   2   D    11             10  -- last row with same ID1 and different ID2 is row 10, so desired_output is the Value of row 10

Looking for a computationally efficient way of doing this.


Solution

  • Here's one way to do it:

    DT[, rleid_id2 := rleid(ID2), by = .(ID1)]
    DT[DT, on = .(ID1, rleid_id2 > rleid_id2), val := i.Value]
    
    > DT
        ID1 ID2 Value rleid_id2 val
     1:   1   A     1         1  NA
     2:   1   B     2         2   1
     3:   1   B     3         2   1
     4:   1   C     4         3   3
     5:   1   C     5         3   3
     6:   1   A     6         4   5
     7:   1   A     7         4   5
     8:   2   D     8         1  NA
     9:   2   D     9         1  NA
    10:   2   E    10         2   9
    11:   2   D    11         3  10
    

    Using rleid we can turn the sequence of ID2 into a numbered id which increments on every change, grouped by ID1. Then we join DT on itself using non-equi condition on rleid_id2 (plus ID1) so that we also extract the value that has the same ID1 but a lower rleid_id2 value.