r string-matching stringr levenshtein-distance fuzzy-comparison

fuzzy version of stringr::str_detect for filtering dataframe

I've got a database with free text fields that I want to use to filter a data.frame or tibble. I could perhaps with lots of work create a list of all possible misspellings of my search terms that currently occur in the data (see example of all the spellings I had of one term below) and then I could just use stringr::str_detect as in the example code below. However, this will not be safe when there might be more misspellings in the future. If I'm willing to accept some limitations / make some assumptions (e.g. how far the edit distance between the misspellings could be, or in terms of some other difference, that people won't use completely different terms etc.), is there some simple solution for doing a fuzzy version of str_detect?

As far as I could see the obvious packages like stringdist do not seem to have a function that directly does this. I guess I could write my own function that applies something like stringdist::afind or stringdist::amatch to each element of a vector and post-processes the results to eventually return a vector of TRUE or FALSE booleans, but I wonder whether this function does not exist somewhere (and is more efficiently implemented than I would do it).

Here's an example that illustrates how with str_detect I might miss one row I would want:

library(tidyverse)

search_terms = c("preclinical", "Preclincal", "Preclincial", "Preclinial", 
                 "Precllinical", "Preclilnical", "Preclinica", "Preclnical", 
                 "Peclinical", "Prclinical", "Peeclinical", "Pre clinical", 
                 "Precclinical", "Preclicnial", "Precliical", "Precliinical", 
                 "Preclinal", "Preclincail", "Preclinicgal", "Priclinical")

example_data = tibble(project=c("A111", "A123", "B112", "A224", "C149"),
                      disease_phase=c("Diabetes, Preclinical", "Lipid lowering, Perlcinical", 
                                      "Asthma, Phase I", "Phase II; Hypertension", "Phase 3"),
                      startdate = c("01DEC2018", "17-OKT-2017", "11/15/2019", "1. Dezember 2004", "2005-11-30")) 

# Finds only project A111, but not A123
example_data %>%
  filter(str_detect(tolower(disease_phase), paste0(tolower(search_terms), collapse="|")))

Solution

You can use agrepl for Approximate String Matching (Fuzzy Matching) which is in base.

example_data[agrep(paste(search_terms, collapse = "|"),
  example_data$disease_phase, 2, ignore.case=TRUE, fixed=FALSE),]
#  project               disease_phase   startdate
#1    A111       Diabetes, Preclinical   01DEC2018
#2    A123 Lipid lowering, Perlcinical 17-OKT-2017

Or using Reduce instead of | in the regex.

example_data[Reduce(\(y, x) y | agrepl(x, example_data$disease_phase, 2,
  ignore.case=TRUE), search_terms, FALSE),]
#  project               disease_phase   startdate
#1    A111       Diabetes, Preclinical   01DEC2018
#2    A123 Lipid lowering, Perlcinical 17-OKT-2017

An alternative might be adist, also in base, which calculates a distance matrix - so it might not be recommended for larger vectors, as the matrix can get large. Here I also choose that a mismatch by 2 characters will be OK.

example_data[colSums(adist(unique(search_terms), example_data$disease_phase,
                           partial=TRUE) < 3) > 0,]
#  project               disease_phase   startdate
#1    A111       Diabetes, Preclinical   01DEC2018
#2    A123 Lipid lowering, Perlcinical 17-OKT-2017

In case only single words are compared it might be more efficient so split the disease_phase into words using strsplit also in base.

. <- strsplit(example_data$disease_phase, "[ ,;]+")
. <- split(rep(seq_along(.), lengths(.)), tolower(unlist(.)))
example_data[unique(unlist(.[Reduce(\(y, x) `[<-`(y, !y, agrepl(x, names(.)[!y],
   2)), tolower(search_terms), logical(length(.)))], FALSE, FALSE)),]
#example_data[unique(unlist(.[Reduce(\(y, x) y | agrepl(x, names(.), 2),
#   tolower(search_terms), FALSE)], FALSE, FALSE)),] #Alternative
#  project               disease_phase   startdate
#2    A123 Lipid lowering, Perlcinical 17-OKT-2017
#1    A111       Diabetes, Preclinical   01DEC2018

Some simpler examples using agrep:

#Allow 1 character difference to make match
agrepl("preclinical", c("precinical", "precinicalxyz", "prelcinical"), 1)
#[1]  TRUE  TRUE FALSE

#Allow 2 character difference to make match
agrepl("preclinical", c("precinical", "precinicalxyz", "prelcinical"), 2)
#[1] TRUE TRUE TRUE

#Use boundaries to match words
agrepl("\\bpreclinical\\b", c("xyz precinical xyz", "xyzpreclinicalxyz"), 1, fixed=FALSE)
#[1]  TRUE FALSE

How much difference will be allowed can be set with max.distance:

max.distance: Maximum distance allowed for a match.  Expressed either
          as integer, or as a fraction of the _pattern_ length times
          the maximal transformation cost (will be replaced by the
          smallest integer not less than the corresponding fraction),
          or a list with possible components

          ‘cost’: maximum number/fraction of match cost (generalized
              Levenshtein distance)

          ‘all’: maximal number/fraction of _all_ transformations
              (insertions, deletions and substitutions)

          ‘insertions’: maximum number/fraction of insertions

          ‘deletions’: maximum number/fraction of deletions

          ‘substitutions’: maximum number/fraction of substitutions

And also a Benchmark based on @JBGruber:

system.time({  #Libraries needed for method of JBGruber
library(dplyr);
library(stringdist);
library(Rfast);
library(tidytext)
})
#       User      System verstrichen 
#      1.008       0.040       1.046 

set.seed(42)
example_large <- example_data %>% sample_n(5000, replace = TRUE)

stringdist_detect <- function(a, b, method = "osa", thres = 2) {
  Rfast::rowMins(stringdist::stringdistmatrix(a, b, method = method), value = TRUE) <= thres
}

bench::mark(check = FALSE,
  stringdist_detect = {
     example_large %>% 
      tidytext::unnest_tokens(output = "word", input = "disease_phase", drop = FALSE) %>% 
      filter(stringdist_detect(word, tolower(search_terms), method = "lv"))
  },
  GKi ={. <- strsplit(example_large$disease_phase, "[ ,;]+")
   . <- split(rep(seq_along(.), lengths(.)), tolower(unlist(.)))
   example_large[unique(unlist(.[Reduce(\(y, x) y | agrepl(x, names(.), 2),
     tolower(search_terms), FALSE)], FALSE, FALSE)),]
})
#  expression             min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
#  <bch:expr>        <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>
#1 stringdist_detect  17.42ms  18.65ms      52.8    7.15MB    19.4     19     7
#2 GKi                 5.64ms   6.04ms     165.   869.08KB     6.27    79     3

Also much time could be saved when there is only one, right written, variant of the words of interest in search_terms.