Search code examples
rdata.tablesubset

Subset data to contain only columns whose names match multiple condition using data.table


This is based upon this question. The question said:

Is there a way for me to subset data based on column names starting with a particular string? I have some columns which are like ABC_1 ABC_2 ABC_3 and some like XYZ_1, XYZ_2,XYZ_3 let's say.

How can I subset my df based only on columns containing the above portions of text (lets say, ABC or XYZ)? I can use indices, but the columns are too scattered in data and it becomes too much of hard coding.

Also, I want to only include rows from each of these columns where any of their value is >0 so if either of the 6 columns above has a 1 in the row, it makes a cut into my final data frame.

One of the answers used dplyr to select multiple columns for the new data:

df <- df %>% dplyr:: select(grep("ABC", names(df)), grep("XYZ", names(df)))

However, I'd like a data.table solution similar to the above. I know that (as per one of the answers) that data.table can do one condition, but unsure how to do multiple:

df[, grep("ABC", names(df)), with = FALSE]


Solution

  • You can select multiple columns that match certain patterns in data.table using patterns in its .SDcols argument:

    # turn df into data.table
    setDT(df)
    
    # select columns that contain ABD or XYZ
    df[, .SD, .SDcols=patterns("ABC|XYZ")]
    
    # or
    df[, grep("ABC|XYZ", names(df)), with=FALSE]
    

    subsetting rows and columns at the same time

    cols = grep("ABC|XYZ",  names(df))
    
    df[rowSums(df[, ..cols]>0)>0, .SD, .SDcols=cols]