Search code examples
rtextmatchingfuzzy-search

Approximate Matching and Replacement in Text R


I have one sentence where I want to replace only a part of a string with a number. If we have an exact match the gsub function works perfectly.

gsub('great thing', 5555 ,c('hey this is a great thing'))
gsub('good rabbit', 5555 ,c('hey this is a good rabbit in the field'))

But now I have the following problem. How can I apply a fuzzy matching function to a string if there is a mistake in the part of the string?

gsub('great thing', 5555 ,c('hey this is a graet thing'))
gsub('good rabbit', 5555 ,c('hey this is a goood rabit in the field'))

The algorithm should figure out that "great thing" and "graet thing" or "good rabbit" and "goood rabit" are very similiar and should be replaced with the number 5555. Best if we can use the Jaro Winkler distance to find an approximate matching within the string and then replacing the approximate substring. I need a very abstract alogrithm wich can do this.

Any ideas?


Solution

  • Some agrep examples:

    agrep("lasy", "1 lazy 2")
    agrep("lasy", "1 lazy 2", max = list(sub = 0))
    agrep("laysy", c("1 lazy", "1", "1 LAZY"), max = 2)
    agrep("laysy", c("1 lazy", "1", "1 LAZY"), max = 2, value = TRUE)
    agrep("laysy", c("1 lazy", "1", "1 LAZY"), max = 2, ignore.case = TRUE)
    

    agrep is in base. If you load stringdist you can calculate string-distance using Jarro-Winkler with (you guessed) stringdist or if you're lazy you can just use ain or amatch. For my purposes, I tend to use Damerau–Levenshtein (method="dl") more, but your mileage might vary.

    Just make sure to read up on exactly how the algorithm's parameters work before you use it (i.e. set your p, q and maxDist values to levels that make sense given what you're doing)