Search code examples
rdata.tablecalculated-columnsdcast

data.table: Create new character column based on indicator columns values and names


I have a data.table with 1.6x10^8 records and I want to create a new character column based on the indicator column names for where there is a value of 1.

For example,

library(data.table)
DT <- data.table::data.table(ID=c("a","a","a","b","b"),
                             drugA=c(1,1,1,0,0),
                             drugB=c(0,1,1,1,0),
                             drugC=c(0,0,1,0,1))

   ID drugA drugB drugC
1:  a     1     0     0
2:  a     1     1     0
3:  a     1     1     1
4:  b     0     1     0
5:  b     0     0     1

### NOTE: I know the paste0(...,collapse) argument might be helpful in concatenating the drug names as an intermediate step
   ID drugA drugB drugC          exposure
1:  a     1     0     0             drugA
2:  a     1     1     0       drugA+drugB
3:  a     1     1     1 drugA+drugB+drugC
4:  b     0     1     0             drugB
5:  b     0     0     1             drugC

I want this to be as robust as clean as possible and solely rely on data.table syntax and/or other useful packages/functions (e.g. dcast) ; I want to steer away from creating an extensive user-defined function because given my data.table size, it would take extremely long to run.

I've looked at other posts but I was unable to find something similar to my situation and desired output.

Any help would be greatly appreciated.


Solution

  • We can do a group by sequence of rows, specify the .SDcols as the 'drug' column, convert the Subset of Data.table (.SD) to logical, use that to subset the column names and paste them together

    library(data.table)
    DT[,  exposure := paste(names(.SD)[as.logical(.SD)], collapse= '+'), 
           1:nrow(DT), .SDcols = drugA:drugC]
    DT
    #   ID drugA drugB drugC          exposure
    #1:  a     1     0     0             drugA
    #2:  a     1     1     0       drugA+drugB
    #3:  a     1     1     1 drugA+drugB+drugC
    #4:  b     0     1     0             drugB
    #5:  b     0     0     1             drugC
    

    Or instead of grouping by rows, we can loop over the columns, change the values to column names and then paste with do.call and remove the NA elements with gsub

    DT[, exposure := gsub("NA\\+|\\+NA", "", do.call(paste, 
       c(Map(function(x, y) names(.SD)[(NA^!x) * y], .SD, 
       seq_along(.SD)), sep="+"))), .SDcols = drugA:drugC]