Search code examples
rdata.tablerowlagshift

R data.table find lags between current row to previous row


> tempDT <- data.table(colA = c("E","E","A","A","E","A","E")
+                      , lags = c(NA,1,1,2,3,1,2))
> tempDT
   colA lags
1:    E   NA
2:    E    1
3:    A    1
4:    A    2
5:    E    3
6:    A    1
7:    E    2

I have column colA, and need to find lags between current row and the previous row whose colA == "E".

Note: if we could find the row reference for the previous row whose colA == "E", then we could calculate the lags. However, I don't know how to achieve it.


Solution

  • 1) Define lastEpos which given i returns the position of the last E among the first i rows and apply that to each row number:

    lastEpos <- function(i) tail(which(tempDT$colA[1:i] == "E"), 1)
    tempDT[, lags := .I - shift(sapply(.I, lastEpos))]
    

    Here are a few variations:

    2) i-1 In this variation lastEpos returns the positions of the last E among the first i-1 rows rather than i:

    lastEpos <- function(i) tail(c(NA, which(tempDT$colA[seq_len(i-1)] == "E")), 1)
    tempDT[, lags := .I - sapply(.I, lastEpos)]
    

    3) Position Similar to (2) but uses Position:

    lastEpos <- function(i) Position(c, tempDT$colA[seq_len(i-1)] == "E", right = TRUE)
    tempDT[, lags := .I - sapply(.I, lastEpos)]
    

    4) rollapply

    library(zoo)
    w <- lapply(1:nrow(tempDT), function(i) -rev(seq_len(i-1)))
    tempDT[, lags := .I - rollapply(colA == "E", w, Position, f = c, right = TRUE)]
    

    5) sqldf

    library(sqldf)
    
    sqldf("select a.colA, a.rowid - b.rowid lags
           from tempDT a left join tempDT b
           on b.rowid < a.rowid and b.colA = 'E'
           group by a.rowid")