Search code examples
rsubstringpattern-matchingpartial-matches

R - how do you join tables as a partial match using a column of substrings in one with a column of strings in another?


I have two tables. I want to bring in information from one to the other by matching any values in the second table key column that are contained within the string values .

To demonstrate the task, I'll use a simplified example (my actual data is many thousands of rows and has more columns):

df <- data.frame(key1 = c("apple pie with custard", "chicken and mushroom pie", "key lime pie", "key limecake"))

df2 <- data.frame(key2 = c("key lime", "chicken", "custard"),
                  val = c("dessert", "savoury", "dessert"))

I want to bring in the "val" column from the second data frame where there is a match. All of the key2 values are either substrings or matching strings with the key1 values, and I only want to match where the whole key2 is in or the same as key1. The matches don't have to be unique, I just need to keep all data from the first data frame (left join). The only other complication is the the key2 values should only match where the key1 values contain them without any letters directly before or after them. Using the above example, the desired output would be:

data.frame(key1 = c("apple pie with custard", "chicken and mushroom pie", "key lime pie", "key limecake"),
+                        val = c("dessert", "savoury", "dessert", NA))

                      key1     val
1   apple pie with custard dessert
2 chicken and mushroom pie savoury
3             key lime pie dessert
4             key limecake    <NA>

Note, as specified earlier, "key limecake" does not match with "key lime" because it appears with letters directly after it. It shouldn't match if it has letters directly before it either.

Any ideas?

Thanks in advance.


Solution

  • Using fuzzyjoin (and dplyr, though that can be removed if needed):

    library(dplyr)
    df2 %>%
      mutate(key2 = paste0("\\b", key2, "\\b")) %>%
      fuzzyjoin::regex_left_join(df, ., by = c(key1 = "key2"))
    #                       key1           key2     val
    # 1   apple pie with custard  \\bcustard\\b dessert
    # 2 chicken and mushroom pie  \\bchicken\\b savoury
    # 3             key lime pie \\bkey lime\\b dessert
    # 4             key limecake           <NA>    <NA>
    

    fuzzyjoin::regex_* expects that the RHS has the patterns and the LHS has the strings to be matched, so we need to keep the updated patterns of df2$key2 on the right side of the join.