I have a large data.table ~ 18*10^6 rows filled with columns ID and CLASS and I want to create a new binary column that indicates the occurrence of a new CLASS value by ID.
DT <- data.table::data.table(ID=c("1","1","1","2","2"),
CLASS=c("a","a","b","c","b"))
### Starting
ID CLASS
1 a
1 a
1 b
2 c
2 b
### Desired
ID CLASS NEWCLS
1 a 1
1 a 0
1 b 1
2 c 1
2 b 1
I originally initialized the NEWCLS variable and used the data.table::shift() function to lag a 1 by ID and CLASS
DT[,NEWCLS:=0]
DT[,NEWCLS:=data.table::shift(NEWCLS, n = 1L, fill = 1, type = "lag"),by=.(ID,CLASS)]
This creates the desired output but with ~18*10^6 rows it takes quite some time, even for data.table.
Would someone know how to create the NEWCLS variable in quicker and more efficient way using solely data.table arguments?
One possibility could be:
DT[, NEWCLS := as.integer(!duplicated(CLASS)), by = ID]
ID CLASS NEWCLS
1: 1 a 1
2: 1 a 0
3: 1 b 1
4: 2 c 1
5: 2 b 1