R Match data tables using string matching

I have two data tables:

dt1 <- data.table(V1=c("Apple Pear Orange, AAA111", "Grapes Banana Pear .BBB222", "Orange Kiwi Melon ,CCC333.", "Apple DDD444, Pear Orange", "Kiwi Melon Orange, CCC333", "Apple Pear Orange, AAA111", "Tomato Cucumber-EEE222", "Seagull Pigeon ZZZ111" ), stringsAsFactors = F)

dt2 <- data.table(Code=c("AAA111", "AAA222", "AAA333", "AAA444", "AAA555", "AAA666", "BBB111", "BBB222", "BBB333", "BBB444", "BBB555", "BBB666", "CCC111", "CCC222", "CCC333", "CCC444", "CCC555", "CCC666", "DDD111", "DDD222", "DDD333", "DDD444", "DDD555", "DDD666", "EEE111", "EEE222", "EEE333", "EEE444", "EEE555", "EEE666"), stringsAsFactors = F)
dt2$Ref <- 1:nrow(dt2)

Each row in dt1 contains an unformatted string that includes a 'Code'. dt2 contains a list of codes that can be matched. What I am after is a way for the 'Code' part of the string in each row of dt1 to be identified and then matched with the corresponding code in dt2. If there is no matching code in dt2 then NA is returned.

Here is the type of output I am after:

dt3 <- data.table(V1=c("Apple Pear Orange, AAA111", "Grapes Banana Pear .BBB222", "Orange Kiwi Melon ,CCC333.", "Apple DDD444, Pear Orange", "Kiwi Melon Orange, CCC333", "Apple Pear Orange, AAA111", "Tomato Cucumber-EEE222", "Seagull Pigeon ZZZ111"), Code=c("AAA111", "BBB222", "CCC333", "DDD444", "CCC333", "AAA111", "EEE222", "NA"), Ref=c("1", "8", "15", "22", "15", "1", "26", "NA"), stringsAsFactors = F)

I have tried using regex, grep etc. to find a solution but have not got anywhere.

Solution

You can use regex_left_join from my fuzzyjoin package:

library(fuzzyjoin)
regex_left_join(dt1, dt2, by = c(V1 = "Code"))
#>                            V1   Code Ref
#> 1:  Apple Pear Orange, AAA111 AAA111   1
#> 2: Grapes Banana Pear .BBB222 BBB222   8
#> 3: Orange Kiwi Melon ,CCC333. CCC333  15
#> 4:  Apple DDD444, Pear Orange DDD444  22
#> 5:  Kiwi Melon Orange, CCC333 CCC333  15
#> 6:  Apple Pear Orange, AAA111 AAA111   1
#> 7:     Tomato Cucumber-EEE222 EEE222  26
#> 8:      Seagull Pigeon ZZZ111     NA  NA