Search code examples
rtidyrstringrfuzzy-comparison

Is there an R package (or existing function) for fuzzy string detection?


I'm looking to something similar to str_detect() from the stringr package, but which is capable of detecting imperfect or "fuzzy" matches. Preferably, I'd like to be able to specify the degree of imperfection (1 different character, 2 different characters, etc.).

The matching I'm doing will take a form similar to the below code (but this is just a simplified example I made up). In the example, only "RUTH CHRIS" gets matched - I'd like something capable of matching the slightly wrong strings as well.

library(tidyverse)

my_restaurants <- tibble(restaurant = c("MCDOlNALD'S ON FRANKLIN ST",
                                        "NEW JERSEY WENDYS",
                                        "8/25/19 RUTH CHRIS",
                                        "MELTINGPO 9823i3")
)

cheap <- c("MCDONALD'S", "WENDY'S") %>% str_c(collapse="|")
expensive <- c("RUTH CHRIS", "MELTING POT") %>% str_c(collapse="|")

my_restaurants %>%
  mutate(category = case_when(
    str_detect(restaurant, cheap) ~ "CHEAP",
    str_detect(restaurant, expensive) ~ "EXPENSIVE"
    )) 

So again, this gives this output:

##  A tibble: 4 × 2
#   restaurant                 category 
#   <chr>                      <chr>    
# 1 MCDOlNALD'S ON FRANKLIN ST NA       
# 2 NEW JERSEY WENDYS          NA       
# 3 8/25/19 RUTH CHRIS         EXPENSIVE
# 4 MELTINGPOT 9823i3          NA 

But I want:

## A tibble: 4 × 2
#   restaurant                 category 
#   <chr>                      <chr>    
# 1 MCDOlNALD'S ON FRANKLIN ST CHEAP       
# 2 NEW JERSEY WENDYS          CHEAP       
# 3 8/25/19 RUTH CHRIS         EXPENSIVE
# 4 MELTINGPOT 9823i3          EXPENSIVE 

I'm not against using regex, but my actual data is significantly more complicated than the given example, so I'd prefer something much more concise that allows for general, not specifc, types of fuzziness.


Solution

  • The top response to this question clued me in to try agrepl(), which seems to best suit my needs for this project since it is a straightforward substitute for str_detect().

    Using my example from above...

    my_restaurants <- tibble(restaurant = c("MCDOlNALD'S ON FRANKLIN ST",
                                            "NEW JERSEY WENDYS",
                                            "8/25/19 RUTH CHRIS",
                                            "MELTINGPO 9823i3")
    )
    
    cheap <- c("MCDONALD'S", "WENDY'S") %>% str_c(collapse="|")
    expensive <- c("RUTH CHRIS", "MELTING POT") %>% str_c(collapse="|")
    
    my_restaurants %>%
      mutate(category = case_when(
        agrepl(cheap, restaurant, 2, fixed=FALSE) ~ "CHEAP",
        agrepl(expensive, restaurant, 2, fixed=FALSE) ~ "EXPENSIVE"
      ))
    

    Gives the output:

    # A tibble: 4 × 2
      restaurant                 category 
      <chr>                      <chr>    
    1 MCDOlNALD'S ON FRANKLIN ST CHEAP    
    2 NEW JERSEY WENDYS          CHEAP    
    3 8/25/19 RUTH CHRIS         EXPENSIVE
    4 MELTINGPO 9823i3           EXPENSIVE
    

    However, onyambu's solutions also seem to be good alternative methods. They allow for more advanced forms of fuzzy matching than agrepl() is capable of.