Search code examples
rdatatablesubsetunique

Subset data table by all unique entries in columns whose name contain a certain substring, fill with NA for other entries


I have a data table and I'd like to copy or subset to all the unique observations in variables whose variable name contains a certain substring. I can conceive of a way to do it clunkily with for loops, but that seems to defeat the utility and speed of a data table.

For example:

library(data.table)
library(dplyr)
dt <- setDT(tibble(a_ID = c(1, 1, 2, 11, 2),
                   b = c('x','y','z','z', 'x'),
                   b_ID = c('XY','XY','XY','XY', 'XY'),
                   d = 1:5))

What I would like to do from here is subset by all columns whose name contains 'ID', returning only the unique entries from each column, and since that will result in columns with differing amounts of observations for each, fill with NA for the rest.

That is, I'd like to return the following:

subset_dt <- setDT(tibble(a_ID = c(1,2,11),
                          b_ID = c('XY', NA,NA)))

Is this possible with data.table functionality?


Solution

  • We may get the unique elements and then replace the duplicated with NA

    library(data.table)
    dt[, lapply(.SD, unique), .SDcols = patterns("_ID$")][,
        lapply(.SD, \(x) replace(x, duplicated(x), NA))]
    

    -output

        a_ID   b_ID
       <num> <char>
    1:     1     XY
    2:     2   <NA>
    3:    11   <NA>
    

    Or another option with unique

    unique(dt[, .(a_ID, b_ID)])[, lapply(.SD, \(x) fcase(!duplicated(x), x))]
        a_ID   b_ID
       <num> <char>
    1:     1     XY
    2:     2   <NA>
    3:    11   <NA>
    

    Or another option is to block the code, check for the lengths after the unique step and append NA to fix the length

    dt[, {lst1 <- lapply(.SD, unique)
         mx <- max(lengths(lst1))
        lapply(lst1, `length<-`, mx)}, .SDcols = patterns("_ID$")]
        a_ID   b_ID
       <num> <char>
    1:     1     XY
    2:     2   <NA>
    3:    11   <NA>
    

    We may also use collapse - select the columns (gvr), get the unique rows (funique), loop over the columns with dapply, replace the duplicates with NA

    library(collapse)
    dapply(funique(gvr(dt, "_ID$")), MARGIN = 2, 
       FUN = \(x) replace(x, duplicated(x), NA))
        a_ID   b_ID
       <num> <char>
    1:     1     XY
    2:     2   <NA>
    3:    11   <NA>