Search code examples
rdata.tablelagshift

data.table: How to indicate first occurrence of unique column value by group


I have a large data.table ~ 18*10^6 rows filled with columns ID and CLASS and I want to create a new binary column that indicates the occurrence of a new CLASS value by ID.

DT <- data.table::data.table(ID=c("1","1","1","2","2"),
                             CLASS=c("a","a","b","c","b"))
### Starting
ID CLASS
 1     a
 1     a
 1     b
 2     c
 2     b

### Desired
ID CLASS NEWCLS
 1     a      1
 1     a      0
 1     b      1
 2     c      1
 2     b      1

I originally initialized the NEWCLS variable and used the data.table::shift() function to lag a 1 by ID and CLASS

DT[,NEWCLS:=0]

DT[,NEWCLS:=data.table::shift(NEWCLS, n = 1L, fill = 1, type = "lag"),by=.(ID,CLASS)]

This creates the desired output but with ~18*10^6 rows it takes quite some time, even for data.table.

Would someone know how to create the NEWCLS variable in quicker and more efficient way using solely data.table arguments?


Solution

  • One possibility could be:

    DT[, NEWCLS := as.integer(!duplicated(CLASS)), by = ID]
    
       ID CLASS NEWCLS
    1:  1     a      1
    2:  1     a      0
    3:  1     b      1
    4:  2     c      1
    5:  2     b      1