I have a table of 10,000 unique names. Using the package(fuzzyjoin)
I would like to match these unique names to names that are only spelled with one different letter. I would like to group the different spelling matches so that they can be assessed for misspelling and corrected in the dataset.
An example using baby names is below. I try to match the original name column to a new column, name_match
, based on a single letter difference (max_dist = 1)
.
So for example, Aadam and Adam are matched because they only differ by a single letter, the extra 'a'. I want to group these two together in the dataset so that a unique ID refers to both names since they are the same, but one is misspelled.
baby_data <-
data.frame(babynames) %>%
select(name) %>%
unique() %>%
filter(str_starts(name, 'A'))
name_match <-
baby_data %>%
stringdist_left_join(.,
baby_data %>%
rename(name_match = name),
max_dist = 1,
by = c('name' = 'name_match')) %>%
# remove rows where the names are the same
filter(!(name == name_match))
A few complications arise; first, Aadam and Adam both show up as matches for each other in both columns (i.e., Aadam = Adam, Adam = Aadam). I only need one record to show the match.
Second, how do I group them? Ideally, the final table would look something like:
ID | Name |
---|---|
1 | Aadam |
1 | Adam |
2 | Anne |
2 | Ann |
I feel like there is a case_when()
or if_else()
problem nested into this?
First, we remove all duplicates from name_match
in the sense that e.g. (Adam, Aadam) = (Aadam, Adam)
. To the remaining entries we assign an id
to each tuple and finally unpivot the data.
library(dplyr)
name_match |>
# remove all "duplicates"
rowwise() |>
mutate(key = paste(sort(c(name, name_match)), collapse = "")) |>
distinct(key, .keep_all = T) |>
select(-key) |>
ungroup() |>
# assign an id to each tuple of names
mutate(id = row_number()) |>
tidyr::pivot_longer(name:name_match) |>
select(-name) |>
rename(name = value)
Note that every name can appear multiple times in the result. Here is one example:
# A tibble: 4 × 2
id name
<int> <chr>
1 1958 Adam
2 1958 Adham
3 1965 Adam
4 1965 Aadam
library(fuzzyjoin)
library(dplyr)
baby_data <-
babynames::babynames |>
select(name) |>
unique() |>
filter(stringr::str_starts(name, 'A'))
name_match <-
baby_data |>
stringdist_left_join(baby_data |>
rename(name_match = name),
max_dist = 1,
by = c('name' = 'name_match')) |>
filter(!(name == name_match))