Search code examples
rdata.tablesequences

Create unique identifier for different row-groups


I want to number certain combinations of row in a dataframe (which is ordered on ID and on Time)

tc <- textConnection('
id              time       end_yn   number
abc             10         0        1
abc             11         0        2
abc             12         1        3 
abc             13         0        1 
def             10         0        1
def             15         1        2
def             16         0        1
def             17         0        2
def             18         1        3
')

test <- read.table(tc, header=TRUE)

The goal is to create a new column ("journey_nr") that give a unique number to each row based on the journey it belongs to. Journeys are defined as a sequence of rows per id up until to end_yn == 1, also if end_ynnever becomes 1, the journey should also be numbered (see the expected outcome example). It is only possible to have end_yn == 0 journeys at the end of a collection of rows for an ID (as shown at row 4 for id 3). So either no end_yn == 1 has occured for that ID or that happened before the end_yn == 0-journey (see id == abc in the example).

I know how to number using the data.table package, but I do not know which columns to combine in order to get the expected outcome. I've searched the data.table-tag on SO, but could not find a similar problem.

Expected outcome:

id              time       end_yn   number    journey_nr
abc             10         0        1         1
abc             11         0        2         1
abc             12         1        3         1
abc             13         0        1         2
def             10         0        1         3
def             15         1        2         3
def             16         0        1         4
def             17         0        2         4
def             18         1        3         4

Solution

  • Another base R answer:

    test$journey <- cumsum(c(1,head(test$number,-1)) >= test$number)
    

    Result:

    > test
       id time end_yn number journey
    1 abc   10      0      1       1
    2 abc   11      0      2       1
    3 abc   12      1      3       1
    4 abc   13      0      1       2
    5 def   10      0      1       3
    6 def   15      1      2       3
    7 def   16      0      1       4
    8 def   17      0      2       4
    9 def   18      1      3       4