Search code examples
rdata.tablegroupingmultiple-columnsdcast

data.table: How to change column values based on grouped unique row values that contain column names


I have a data.table with ~18^6 rows and I need to take the unique values of CLASS, by ID, and set their respective columns to 1, as seen below in a baby example

DT <- data.table::data.table(ID=c("1","1","1","2","2"),
                             CLASS=c("a","a","b","c","b"),
                             a=c(0,0,0,0,0),
                             b=c(0,0,0,0,0),
                             c=c(0,0,0,0,0))

### Start with this
ID CLASS a b c
1      a 0 0 0
1      a 0 0 0
1      b 0 0 0
2      c 0 0 0
2      b 0 0 0

### Want this
ID CLASS a b c
1      a 1 1 0
1      a 1 1 0
1      b 1 1 0
2      c 0 1 1
2      b 0 1 1

My first instinct was to try the code below but found that it will set all columns to 1 because unique(DT$CLASS) inherently includes all unique values for all IDs and is not passed through the "grouping" argument parameter so to say.

### Tried this
DT[,unique(DT$CLASS):=1,by=ID]

### Got this
ID CLASS a b c
1      a 1 1 1
1      a 1 1 1
1      b 1 1 1
2      c 1 1 1
2      b 1 1 1

I've been struggling with utilizing data.table to its full potential and speed and would like to create the desired output only using commands within data.table arguments.

Can someone help me write the proper code, using only data.table commands/arguments,so that my jth index includes only the unique values, by ID, and sets the appropriate columns to 1?

Follow-up Question:

Say that each row also has an associated date to it, RXDATE, and I'd want to create respective column names of all class values that holds the minimum RXDATE by CLASS by ID. Could I also turn to dcast for this?

### Start with this
ID CLASS a b c RXDATE
1      a 1 1 0 1-1-99
1      a 1 1 0 1-2-99
1      b 1 1 0 1-3-99
2      c 0 1 1 5-4-00
2      b 0 1 1 6-5-01

### Want this
ID CLASS a b c RXDATE   a_DT   b_DT   c_DT
1      a 1 1 0 1-1-99 1-1-99 1-3-99     NA
1      a 1 1 0 1-2-99 1-1-99 1-3-99     NA
1      b 1 1 0 1-3-99 1-1-99 1-3-99     NA
2      c 0 1 1 5-4-00     NA 6-5-01 5-4-00
2      b 0 1 1 6-5-01     NA 6-5-01 5-4-00

Solution

  • Using dcast and merge you could also do:

    DT <- data.table::data.table(ID=c("1","1","1","2","2"),
                                 CLASS=c("a","a","b","c","b"),
                                 a=c(0,0,0,0,0),
                                 b=c(0,0,0,0,0),
                                 c=c(0,0,0,0,0))
    
    # dcast to convert to wide
    DT_dcast <- dcast(DT[, .(ID, CLASS)], ID ~ CLASS, fun.aggregate = function(x) length(unique(x)), value.var = "CLASS")
    DT_dcast
       ID a b c
    1:  1 1 1 0
    2:  2 0 1 1
    
    # Then merge with the original data.table
    DT_m <- merge(DT[, .(ID, CLASS)], DT_dcast, by = "ID")
    DT_m
       ID CLASS a b c
    1:  1     a 1 1 0
    2:  1     a 1 1 0
    3:  1     b 1 1 0
    4:  2     c 0 1 1
    5:  2     b 0 1 1
    

    EDIT You could still use the same approach with dcast and merge.

    I noticed from your 'start with this' data has different RX date for row 2 and from 'want this' data you only kept '1-1-99' for this.

    DT2 <- data.table::data.table(ID=c("1","1","1","2","2"),
                                 CLASS=c("a","a","b","c","b"),
                                 a=c(0,0,0,0,0),
                                 b=c(0,0,0,0,0),
                                 c=c(0,0,0,0,0), 
                                 RXDate = c("1-1-99", "1-2-99", "1-3-99", "5-4-00", "6-5-01"))
    
    # 2nd row from the data provided has different RXDate under same ID and Class.
    # Use x[1] to pick first
    DT_dcast <- dcast(DT2[, .(ID, CLASS, RXDate)], ID ~ CLASS, 
                      fun.aggregate = function(x) x[1], 
                      value.var = c("CLASS", "RXDate"))
    DT_dcast
       ID CLASS.1_a CLASS.1_b CLASS.1_c RXDate_a RXDate_b RXDate_c
    1:  1         a         b      <NA>   1-1-99   1-3-99     <NA>
    2:  2      <NA>         b         c     <NA>   6-5-01   5-4-00
    
    # Convert 1 or 0 under CLASS
    class_cols <- names(DT_dcast)[grepl("CLASS", names(DT_dcast))]
    for (col in class_cols) set(DT_dcast, j = col, value = ifelse(is.na(DT_dcast[[col]]), 0, 1))
    
    DT_dcast
    ID CLASS.1_a CLASS.1_b CLASS.1_c RXDate_a RXDate_b RXDate_c
    1:  1         1         1         0   1-1-99   1-3-99     <NA>
    2:  2         0         1         1     <NA>   6-5-01   5-4-00
    
    # Then merge with the original data.table
    DT_m <- merge(DT2[, .(ID, CLASS, RXDate)], DT_dcast, by = "ID")
    DT_m
    
       ID CLASS RXDate CLASS.1_a CLASS.1_b CLASS.1_c RXDate_a RXDate_b RXDate_c
    1:  1     a 1-1-99         1         1         0   1-1-99   1-3-99     <NA>
    2:  1     a 1-2-99         1         1         0   1-1-99   1-3-99     <NA>
    3:  1     b 1-3-99         1         1         0   1-1-99   1-3-99     <NA>
    4:  2     c 5-4-00         0         1         1     <NA>   6-5-01   5-4-00
    5:  2     b 6-5-01         0         1         1     <NA>   6-5-01   5-4-00
    

    If you want to rename the columns then you can do it using setnames