I have a data.table with 1.6x10^8 records and I want to create a new character column based on the indicator column names for where there is a value of 1.
For example,
library(data.table)
DT <- data.table::data.table(ID=c("a","a","a","b","b"),
drugA=c(1,1,1,0,0),
drugB=c(0,1,1,1,0),
drugC=c(0,0,1,0,1))
ID drugA drugB drugC
1: a 1 0 0
2: a 1 1 0
3: a 1 1 1
4: b 0 1 0
5: b 0 0 1
### NOTE: I know the paste0(...,collapse) argument might be helpful in concatenating the drug names as an intermediate step
ID drugA drugB drugC exposure
1: a 1 0 0 drugA
2: a 1 1 0 drugA+drugB
3: a 1 1 1 drugA+drugB+drugC
4: b 0 1 0 drugB
5: b 0 0 1 drugC
I want this to be as robust as clean as possible and solely rely on data.table syntax and/or other useful packages/functions (e.g. dcast) ; I want to steer away from creating an extensive user-defined function because given my data.table size, it would take extremely long to run.
I've looked at other posts but I was unable to find something similar to my situation and desired output.
Any help would be greatly appreciated.
We can do a group by sequence of rows, specify the .SDcols
as the 'drug' column, convert the Subset of Data.table (.SD
) to logical
, use that to subset the column names and paste
them together
library(data.table)
DT[, exposure := paste(names(.SD)[as.logical(.SD)], collapse= '+'),
1:nrow(DT), .SDcols = drugA:drugC]
DT
# ID drugA drugB drugC exposure
#1: a 1 0 0 drugA
#2: a 1 1 0 drugA+drugB
#3: a 1 1 1 drugA+drugB+drugC
#4: b 0 1 0 drugB
#5: b 0 0 1 drugC
Or instead of grouping by rows, we can loop over the columns, change the values to column names and then paste
with do.call
and remove the NA
elements with gsub
DT[, exposure := gsub("NA\\+|\\+NA", "", do.call(paste,
c(Map(function(x, y) names(.SD)[(NA^!x) * y], .SD,
seq_along(.SD)), sep="+"))), .SDcols = drugA:drugC]