Search code examples
rdata.tablesequences

number rows by variable, but start over when condition is hit


I want to number certain combinations of row in a dataframe (which is ordered on ID and on Time)

tc <- textConnection('
id              time       end_yn
abc             10         0
abc             11         0
abc             12         1
abc             13         0
def             10         0
def             15         1
def             16         0
def             17         0
def             18         1
')

test <- read.table(tc, header=TRUE)

The goal is to create a new column ("number") that numbers each row per id from 1 to n until end_yn == 1 is hit. After end_yn == 1, the numbering should start over.

Without taking the end_yn == 1 condition into account the rows can be numbered using:

DT <- data.table(test)
DT[, id := seq_len(.N), by = id]

However the expected outcome should be:

id              time       end_yn   number
abc             10         0        1
abc             11         0        2
abc             12         1        3 
abc             13         0        1 
def             10         0        1
def             15         1        2
def             16         0        1
def             17         0        2
def             18         1        3

How to incorporate the end_yn == 1 condition?


Solution

  • I'm guessing there are different ways to do this, but here's one:

    DT[, cEnd := c(0,cumsum(end_yn)[-.N])] # carry the end value forward
    
    DT[, number := seq_len(.N), by = "id,cEnd"] # create your sequence
    
    DT[, cEnd := NULL] # remove the column created above
    

    Setting id as the key for DT might be worth while.