Search code examples
rstringgroupingdata-wranglingfuzzyjoin

How to group similar spelled character strings together?


I have a table of 10,000 unique names. Using the package(fuzzyjoin) I would like to match these unique names to names that are only spelled with one different letter. I would like to group the different spelling matches so that they can be assessed for misspelling and corrected in the dataset.

An example using baby names is below. I try to match the original name column to a new column, name_match, based on a single letter difference (max_dist = 1).

So for example, Aadam and Adam are matched because they only differ by a single letter, the extra 'a'. I want to group these two together in the dataset so that a unique ID refers to both names since they are the same, but one is misspelled.

baby_data <- 
  data.frame(babynames) %>% 
  select(name) %>% 
  unique() %>% 
  filter(str_starts(name, 'A'))

name_match <- 
  baby_data %>% 
  stringdist_left_join(.,
                       baby_data %>% 
                         rename(name_match = name),
                       max_dist = 1,
                       by = c('name' = 'name_match')) %>% 
  # remove rows where the names are the same
  filter(!(name == name_match))

A few complications arise; first, Aadam and Adam both show up as matches for each other in both columns (i.e., Aadam = Adam, Adam = Aadam). I only need one record to show the match.

Second, how do I group them? Ideally, the final table would look something like:

ID Name
1 Aadam
1 Adam
2 Anne
2 Ann

I feel like there is a case_when() or if_else() problem nested into this?


Solution

  • First, we remove all duplicates from name_match in the sense that e.g. (Adam, Aadam) = (Aadam, Adam). To the remaining entries we assign an id to each tuple and finally unpivot the data.

    library(dplyr)
    
    name_match |>
    
        # remove all "duplicates"
        rowwise() |>
        mutate(key = paste(sort(c(name, name_match)), collapse = "")) |>
        distinct(key, .keep_all = T) |>
        select(-key) |>
        ungroup() |>
    
        # assign an id to each tuple of names
        mutate(id = row_number()) |>
        tidyr::pivot_longer(name:name_match) |>
        select(-name) |>
        rename(name = value)
    

    Note that every name can appear multiple times in the result. Here is one example:

    # A tibble: 4 × 2
         id name 
      <int> <chr>
    1  1958 Adam 
    2  1958 Adham
    3  1965 Adam 
    4  1965 Aadam
    

    Data

    library(fuzzyjoin)
    library(dplyr)
    
    baby_data <- 
      babynames::babynames |> 
      select(name) |> 
      unique() |> 
      filter(stringr::str_starts(name, 'A'))
    
    name_match <- 
      baby_data |> 
      stringdist_left_join(baby_data |> 
                             rename(name_match = name),
                           max_dist = 1,
                           by = c('name' = 'name_match')) |> 
      filter(!(name == name_match))