I have a dataframe of unstructured names, and I want to create a 'master' list of the cleaned name in one column with all the variants in another column. I am using the stringdist
package. Below is a small example:
library(dplyr) # for pipes
library(tidyr) # for expand_grid()
library(stringdist)
words <- c("dog","dot","don","con","cry","croak","cat","dogg", "dogy", "dog", "cat", "dog")
# compare everything to everything
words_df <- expand_grid(raw = words, clean = words) %>%
mutate(dist = stringdist(raw, clean, method = "jw") %>%
# compute word frequency
group_by(clean) %>%
mutate(count = n()) %>%
filter(dist < 0.3)
This results in a df with the distance and word counts for all similar enough combinations:
|raw |clean | dist| count|
|:---|:-----|---------:|-----:|
|dog |dog | 0.0000000| 36|
|dog |dot | 0.2222222| 12|
|dog |don | 0.2222222| 12|
|dog |dogg | 0.0833333| 12|
|dog |dogy | 0.0833333| 12|
|dog |dog | 0.0000000| 36|
|dog |dog | 0.0000000| 36|
|dot |dog | 0.2222222| 36|
|dot |dot | 0.0000000| 12|
|dot |don | 0.2222222| 12|
You can see that in the clean
column I have two entries for "dog" and "dogg", which I would want to collapse into one entry (dog) because the string "dog" appears more often.
Here is what I've tried so far:
dict <- words_df %>%
mutate(clean_new = ifelse(dist < 0.085, words_df[which.max(words_df$count)][[1]][1], clean))
Which results in:
|raw |clean | dist| count|clean_new |
|:---|:-----|---------:|-----:|:---------|
|dog |dog | 0.0000000| 36|NA |
|dog |dot | 0.2222222| 12|dot |
|dog |don | 0.2222222| 12|don |
|dog |con | 0.4444444| 12|con |
|dog |cry | 1.0000000| 12|cry |
|dog |croak | 0.4888889| 12|croak |
|dog |cat | 1.0000000| 24|cat |
|dog |dogg | 0.0833333| 12|NA |
|dog |dogy | 0.0833333| 12|NA |
|dog |dog | 0.0000000| 36|NA |
Essentially what I want to create is a dictionary of all variants of the word, based on frequency of the closest word match.
Thanks all!
This dplyr
pipe statement will return a data frame with 9 rows, one for each of the unique elements in your original words
vector. First we group_by
the raw
column which creates a group for each unique word, then filter
by your distance threshold, then find the corresponding word in clean
with the highest frequency in the original dataset. In your example all words match themselves except for the two variants of "dog."
words_df %>%
group_by(raw) %>%
filter(dist < 0.085) %>%
summarize(clean = clean[which.max(count)])
# A tibble: 9 x 2
raw clean
<chr> <chr>
1 cat cat
2 con con
3 croak croak
4 cry cry
5 dog dog
6 dogg dog
7 dogy dog
8 don don
9 dot dot