Search code examples
rdatatable

subset rows in data table with many columns


I'm trying my best to not compute on rows using data.table since it's not advisable. I want to subset all rows that have a value greater or equal to 2 of a large data.table that has many columns, and potentially thereafter also get the columns.

Reproducible example:

 DT <- as.data.table( rbind( diag(5)+1, diag(5), rep(0,5)) )
 DT[1,1] = 1

My code:

DT[,lapply(.SD, function(x) x>=2 )]

I get a data.table of booleans, confused now how I should subset this.

One approach that works mentioned by @IRTFM is:

DT[DT[,apply(.SD, 1, function(x) any(x>=2))]]

    V1 V2 V3 V4 V5
1:  1  2  1  1  1
2:  1  1  2  1  1
3:  1  1  1  2  1
4:  1  1  1  1  2

Now how I should also subset the columns that have a 2 value ??

Desirable result is:

      V2 V3 V4 V5
1:    2  1  1  1
2:    1  2  1  1
3:    1  1  2  1
4:    1  1  1  2

Solution

  • I could not get your example to load in my console session but this is a much more "minima;" example that demonstrates a method. Not sure if it has the usual data.table efficiency though:

    DT <- setDT( data.frame(x=1:2, y=0,z=0))
    DT[, apply(.SD, 1, function(x){any(x>=2)}) ] gets you a logical vector for each row
    # [1] FALSE  TRUE
    DT[ DT[, apply(.SD, 1, function(x){any(x>=2)}) ]] # uses that vector to select rows
       x y z
    1: 2 0 0
    

    This should succeed as well:

    DT[ as.logical(rowSums(DT >= 2))]
       x y z
    1: 2 0 0
    

    For the second part consider this:

    cols <- sapply(DT, function(x){ any(x>0)})
    DT2[ ,.SD, .SDcols=names(cols[cols])]