Search code examples
rregexsql-likesqldf

Merge two tables on LIKE but for whole string not parts of strings


This is my first post/question so be kind. I have a dataframe like this:

        id                             product
1 00109290                    Wax Salt; Pepper
2 23243242                          Wood Stuff
3 23242433   Magic Unicorn Powder and My Tears
4 23778899                             gelatin
5 25887766                                tin;
6  7786655             fart noises, and things
7  3432422 --spearmint bacon& hydrangia leaves

I have a lookup table like this:

        ingredients
1               wax
2              salt
3              wood
4          my tears
5    unicorn powder
6           gelatin
7               tin
8  hydrangia leaves
9         spearmint
10            bacon

I want to merge them on whole strings so I get this:

     id                             product      ingredients
1  00109290                    Wax Salt; Pepper              wax
2  00109290                    Wax Salt; Pepper             salt
3  23243242                          Wood Stuff             wood
4  23242433   Magic Unicorn Powder and My Tears         my tears
5  23242433   Magic Unicorn Powder and My Tears   unicorn powder
6  23778899                             gelatin          gelatin
7  25887766                                tin;              tin
8   3432422 --spearmint bacon& hydrangia leaves hydrangia leaves
9   3432422 --spearmint bacon& hydrangia leaves        spearmint
10  3432422 --spearmint bacon& hydrangia leaves            bacon

Instead I get this (notice line 7 not wanted):

         id                             product      ingredients
1  00109290                    Wax Salt; Pepper              wax
2  00109290                    Wax Salt; Pepper             salt
3  23243242                          Wood Stuff             wood
4  23242433   Magic Unicorn Powder and My Tears         my tears
5  23242433   Magic Unicorn Powder and My Tears   unicorn powder
6  23778899                             gelatin          gelatin
7  23778899                             gelatin              tin
8  25887766                                tin;              tin
9   3432422 --spearmint bacon& hydrangia leaves hydrangia leaves
10  3432422 --spearmint bacon& hydrangia leaves        spearmint
11  3432422 --spearmint bacon& hydrangia leaves            bacon

I am so painfully close, but I am matching incorrectly to 'gelatin' with 'tin'. I want to match whole words, not parts of words. I've tried many different techniques, the closest that gets met there is this:

library(sqldf)
id <- c('00109290', '23243242', '23242433', 
        '23778899', '25887766', '7786655', 
        '3432422')
product <- c('Wax Salt; Pepper', 'Wood Stuff', 
             'Magic Unicorn Powder and My Tears', 
             'gelatin', 'tin;', 'fart noises, and things', 
             '--spearmint bacon& hydrangia leaves')

ingredients <- c('wax', 'salt', 'wood', 'my tears', 
                 'unicorn powder', 'gelatin', 'tin', 
                 'hydrangia leaves', 
                 'spearmint', 'bacon') 

products <- data.frame(id, product)
ingred <- data.frame(ingredients)    
new_df <- sqldf("SELECT * from products 
                 join ingred on product LIKE '%' || ingredients || '%'")

Truly appreciate any advice. Perhaps a completely different approach is needed? I also welcome advice on the quality of the question, it's my first so you'd better set me straight right away.


Solution

  • A solution using the fuzzyjoin package, and str_detect from stringr:

    library(fuzzyjoin)
    library(stringr)
    
    f <- function(x, y) {
      # tests whether y is an ingredient of x
      str_detect(x, regex(paste0("\\b", y, "\\b"), ignore_case = TRUE))
    }
    
    fuzzy_join(products, 
               ingred, 
               by = c("product" = "ingredients"), 
               match_fun = f)
    #         id                           product    ingredients
    # 1   109290                  Wax Salt; Pepper            wax
    # 2   109290                  Wax Salt; Pepper           salt
    # 3 23243242                        Wood Stuff           wood
    # 4 23242433 Magic Unicorn Powder and My Tears       my tears
    # 5 23242433 Magic Unicorn Powder and My Tears unicorn powder
    # 6 23778899                           gelatin        gelatin
    

    Data

    products <- read.table(text = "
            id                             product
    1 00109290                  'Wax Salt; Pepper'
    2 23243242                        'Wood Stuff'
    3 23242433 'Magic Unicorn Powder and My Tears'
    4 23778899                             gelatin                  
      ", stringsAsFactors = FALSE)
    
    ingred <- read.table(text = "
           ingredients
    1              wax
    2             salt
    3             wood
    4       'my tears'
    5 'unicorn powder'
    6          gelatin
    7              tin
      ", stringsAsFactors = FALSE)