Search code examples
rstringdist

Replace string with most frequent fuzzy match


I have a dataframe of unstructured names, and I want to create a 'master' list of the cleaned name in one column with all the variants in another column. I am using the stringdist package. Below is a small example:

library(dplyr) # for pipes 
library(tidyr) # for expand_grid()
library(stringdist) 

words <- c("dog","dot","don","con","cry","croak","cat","dogg", "dogy", "dog", "cat", "dog")
# compare everything to everything 
words_df <- expand_grid(raw = words, clean = words) %>%
    mutate(dist = stringdist(raw, clean, method = "jw") %>% 
    # compute word frequency 
    group_by(clean) %>%
    mutate(count = n()) %>%
filter(dist < 0.3) 

This results in a df with the distance and word counts for all similar enough combinations:

|raw |clean |      dist| count|
|:---|:-----|---------:|-----:|
|dog |dog   | 0.0000000|    36|
|dog |dot   | 0.2222222|    12|
|dog |don   | 0.2222222|    12|
|dog |dogg  | 0.0833333|    12|
|dog |dogy  | 0.0833333|    12|
|dog |dog   | 0.0000000|    36|
|dog |dog   | 0.0000000|    36|
|dot |dog   | 0.2222222|    36|
|dot |dot   | 0.0000000|    12|
|dot |don   | 0.2222222|    12|

You can see that in the clean column I have two entries for "dog" and "dogg", which I would want to collapse into one entry (dog) because the string "dog" appears more often.

Here is what I've tried so far:

dict <- words_df %>%
    mutate(clean_new = ifelse(dist < 0.085, words_df[which.max(words_df$count)][[1]][1], clean))    

Which results in:

|raw |clean |      dist| count|clean_new |
|:---|:-----|---------:|-----:|:---------|
|dog |dog   | 0.0000000|    36|NA        |
|dog |dot   | 0.2222222|    12|dot       |
|dog |don   | 0.2222222|    12|don       |
|dog |con   | 0.4444444|    12|con       |
|dog |cry   | 1.0000000|    12|cry       |
|dog |croak | 0.4888889|    12|croak     |
|dog |cat   | 1.0000000|    24|cat       |
|dog |dogg  | 0.0833333|    12|NA        |
|dog |dogy  | 0.0833333|    12|NA        |
|dog |dog   | 0.0000000|    36|NA        |

Essentially what I want to create is a dictionary of all variants of the word, based on frequency of the closest word match.

Thanks all!


Solution

  • This dplyr pipe statement will return a data frame with 9 rows, one for each of the unique elements in your original words vector. First we group_by the raw column which creates a group for each unique word, then filter by your distance threshold, then find the corresponding word in clean with the highest frequency in the original dataset. In your example all words match themselves except for the two variants of "dog."

    Code

    words_df %>%
      group_by(raw) %>%
      filter(dist < 0.085) %>%
      summarize(clean = clean[which.max(count)])
    

    Output

    # A tibble: 9 x 2
      raw   clean
      <chr> <chr>
    1 cat   cat  
    2 con   con  
    3 croak croak
    4 cry   cry  
    5 dog   dog  
    6 dogg  dog  
    7 dogy  dog  
    8 don   don  
    9 dot   dot