Search code examples
rdata.tablestrsplit

Filtering data.table rows by the presence of a column in a strsplit of another column


I have a data table:

dt <- data.table(col1=c('aa,bb', 'bb,cc,ee', 'dd,ee'), col2=c('aa', 'cc', 'aa'))
> dt
    col1      col2
1: aa,bb      aa
2: bb,cc,ee   cc
3: dd,ee      aa

I want to check if column 2 occurs in the strsplit of column one, so for the first row if aa is present in aa,bb split by a comma, which is true. It's also true for the second row, and false for the third. I only want to keep the rows where this occurs, so only row 1 and 2.

My first thought was doing it like this:

dt[col2 %in% strsplit(col1, ',')]

However, that returns an empty data.table.

I can think of multiple solutions to solve this, including making new columns using tstrsplit, or melting the data table, but all of these are a bit tedious for such a seemingly simple task. Any suggestions?


Solution

  • We can use str_detect from stringr

    library(stringr)
    dt[, flag := str_detect(col1, col2)]
    dt
    #       col1 col2  flag
    #1:    aa,bb   aa  TRUE
    #2: bb,cc,ee   cc  TRUE
    #3:    dd,ee   aa FALSE
    

    Also, to avoid any substring matches, we can specify the word boundary (\\b)

    dt[, str_detect(col1, str_c("\\b", col2, "\\b"))]
    #[1]  TRUE  TRUE FALSE
    

    Regarding the use of strsplit, the output would be a list of vectors. So, we need to use a function that checks the values of 'col1' are in the corresponding elements of list. Map does that

    dt[,  unlist(Map(`%in%`, col2, strsplit(col1, ",")))]
    

    To apply the filter in the same step and return the 2 row data.table:

    dt[unlist(Map(`%in%`, col2, strsplit(col1, ",")))]