Search code examples
rmergedata.table

join two data.table based on pattern matching


Given:

library(data.table)
dat1 <- setDT(data.frame(pat=c("A.C",".BC"),val=c(1,2)))
dat2 <- setDT(data.frame(q=c("ABC","AXC","XBC"),val2=c(10,11,12)))

I want the result to be:

dat3 <- setDT(data.frame(pat=c("A.C","A.C",".BC",".BC"),val=c(1,1,2,2),q=c("ABC","AXC","ABC","XBC"),val2=c(10,11,10,12)))
dat3
   pat val   q val2
1: A.C   1 ABC   10
2: A.C   1 AXC   11
3: .BC   2 ABC   10
4: .BC   2 XBC   12

In other words, a left join of dat1 to dat2 for each match of regex pattern pat to query string q
I was wondering if this is possible with a concise data.table merge expression, i.e.

dat1[dat2, .(pat, val, q, val2), on= .(grepl(pat,q))] # this does not work

or any other data.table trick for efficiency. In reality dat1 is hundreds of rows and dat2 can be 10-100 thousand rows.
This sounds simple enough, but I haven't found a post that quite covers it.

The closest I've gotten to is the following:

match_pat_to_q <- function(pattern, data, data.col="q"){
  ret<-lapply(pattern, function(x){
    data[grepl(x,get(data.col))]
  })
  names(ret) <- pattern # becomes an .id column in next step
  rbindlist(ret, idcol=TRUE)
}

match_pat_to_q(dat1$pat, dat2)[dat1, on=.(.id==pat)]

   .id   q val2 val
1: A.C ABC   10   1
2: A.C AXC   11   1
3: .BC ABC   10   2
4: .BC XBC   12   2

Solution

  • You can solve your problem as follows:

    dat1[, dat2[grep(pat, q), .(val, x=q, val2)], by=pat]
    
          pat   val      x  val2
       <char> <num> <char> <num>
    1:    A.C     1    ABC    10
    2:    A.C     1    AXC    11
    3:    .BC     2    ABC    10
    4:    .BC     2    XBC    12
    

    You group dat1 by pat. This allows to logically create one group for each pattern in dat1. Then for each group, you filter the corresponding data from dat2 using pattern matching while selecting the variables needed at the same time.