I have a data.table with ~18^6 rows and I need to take the unique values of CLASS, by ID, and set their respective columns to 1, as seen below in a baby example
DT <- data.table::data.table(ID=c("1","1","1","2","2"),
CLASS=c("a","a","b","c","b"),
a=c(0,0,0,0,0),
b=c(0,0,0,0,0),
c=c(0,0,0,0,0))
### Start with this
ID CLASS a b c
1 a 0 0 0
1 a 0 0 0
1 b 0 0 0
2 c 0 0 0
2 b 0 0 0
### Want this
ID CLASS a b c
1 a 1 1 0
1 a 1 1 0
1 b 1 1 0
2 c 0 1 1
2 b 0 1 1
My first instinct was to try the code below but found that it will set all columns to 1 because unique(DT$CLASS) inherently includes all unique values for all IDs and is not passed through the "grouping" argument parameter so to say.
### Tried this
DT[,unique(DT$CLASS):=1,by=ID]
### Got this
ID CLASS a b c
1 a 1 1 1
1 a 1 1 1
1 b 1 1 1
2 c 1 1 1
2 b 1 1 1
I've been struggling with utilizing data.table to its full potential and speed and would like to create the desired output only using commands within data.table arguments.
Can someone help me write the proper code, using only data.table commands/arguments,so that my jth index includes only the unique values, by ID, and sets the appropriate columns to 1?
Follow-up Question:
Say that each row also has an associated date to it, RXDATE, and I'd want to create respective column names of all class values that holds the minimum RXDATE by CLASS by ID. Could I also turn to dcast for this?
### Start with this
ID CLASS a b c RXDATE
1 a 1 1 0 1-1-99
1 a 1 1 0 1-2-99
1 b 1 1 0 1-3-99
2 c 0 1 1 5-4-00
2 b 0 1 1 6-5-01
### Want this
ID CLASS a b c RXDATE a_DT b_DT c_DT
1 a 1 1 0 1-1-99 1-1-99 1-3-99 NA
1 a 1 1 0 1-2-99 1-1-99 1-3-99 NA
1 b 1 1 0 1-3-99 1-1-99 1-3-99 NA
2 c 0 1 1 5-4-00 NA 6-5-01 5-4-00
2 b 0 1 1 6-5-01 NA 6-5-01 5-4-00
Using dcast
and merge
you could also do:
DT <- data.table::data.table(ID=c("1","1","1","2","2"),
CLASS=c("a","a","b","c","b"),
a=c(0,0,0,0,0),
b=c(0,0,0,0,0),
c=c(0,0,0,0,0))
# dcast to convert to wide
DT_dcast <- dcast(DT[, .(ID, CLASS)], ID ~ CLASS, fun.aggregate = function(x) length(unique(x)), value.var = "CLASS")
DT_dcast
ID a b c
1: 1 1 1 0
2: 2 0 1 1
# Then merge with the original data.table
DT_m <- merge(DT[, .(ID, CLASS)], DT_dcast, by = "ID")
DT_m
ID CLASS a b c
1: 1 a 1 1 0
2: 1 a 1 1 0
3: 1 b 1 1 0
4: 2 c 0 1 1
5: 2 b 0 1 1
EDIT
You could still use the same approach with dcast
and merge
.
I noticed from your 'start with this' data has different RX date for row 2 and from 'want this' data you only kept '1-1-99' for this.
DT2 <- data.table::data.table(ID=c("1","1","1","2","2"),
CLASS=c("a","a","b","c","b"),
a=c(0,0,0,0,0),
b=c(0,0,0,0,0),
c=c(0,0,0,0,0),
RXDate = c("1-1-99", "1-2-99", "1-3-99", "5-4-00", "6-5-01"))
# 2nd row from the data provided has different RXDate under same ID and Class.
# Use x[1] to pick first
DT_dcast <- dcast(DT2[, .(ID, CLASS, RXDate)], ID ~ CLASS,
fun.aggregate = function(x) x[1],
value.var = c("CLASS", "RXDate"))
DT_dcast
ID CLASS.1_a CLASS.1_b CLASS.1_c RXDate_a RXDate_b RXDate_c
1: 1 a b <NA> 1-1-99 1-3-99 <NA>
2: 2 <NA> b c <NA> 6-5-01 5-4-00
# Convert 1 or 0 under CLASS
class_cols <- names(DT_dcast)[grepl("CLASS", names(DT_dcast))]
for (col in class_cols) set(DT_dcast, j = col, value = ifelse(is.na(DT_dcast[[col]]), 0, 1))
DT_dcast
ID CLASS.1_a CLASS.1_b CLASS.1_c RXDate_a RXDate_b RXDate_c
1: 1 1 1 0 1-1-99 1-3-99 <NA>
2: 2 0 1 1 <NA> 6-5-01 5-4-00
# Then merge with the original data.table
DT_m <- merge(DT2[, .(ID, CLASS, RXDate)], DT_dcast, by = "ID")
DT_m
ID CLASS RXDate CLASS.1_a CLASS.1_b CLASS.1_c RXDate_a RXDate_b RXDate_c
1: 1 a 1-1-99 1 1 0 1-1-99 1-3-99 <NA>
2: 1 a 1-2-99 1 1 0 1-1-99 1-3-99 <NA>
3: 1 b 1-3-99 1 1 0 1-1-99 1-3-99 <NA>
4: 2 c 5-4-00 0 1 1 <NA> 6-5-01 5-4-00
5: 2 b 6-5-01 0 1 1 <NA> 6-5-01 5-4-00
If you want to rename the columns then you can do it using setnames