Search code examples
rstringmatchingfuzzy

Fuzzy matching two long character vectors in R


I have two vectors: Candidates$names containing roughly 45.000 names of electoral candidates and Incumbents$names containing roughly 7600 names of members of parliament. I want to check for each of the names in Candidates whether it exists in Incumbents and create a new dummy variable incumbent in Candidates that takes the value 1 if that is the case, 0 if not.

My problem is that the names between the two lists can vary ever so slightly. Sometimes there are titles included in the name, sometimes middle names etc. So direct matching would not work reliably, but I need an approach that allows for some fuzziness.

I tried expand.grid(Candidates$names, Incumbents$names) in combination with adist() as an indicator of proximity and then setting arbitrary percentages (based on distance/length of name) as a cut-off point, but the lengths of the resulting table crashes R on my computer and the method does not appear practical or sufficiently reliable.

Is there a better way to perform the required fuzzy matching?

EDIT: Here some example vectors.

Candidates <- data.frame(name = c("Barack Obama", "George W. Bush", "Jimmy Carter", "Tony Blair", "Mickey Mouse", "Darth Vader"), incumbent = NA)
Incumbents <- data.frame(name = c("Anakin Skywalker", "Sir Tony Blair", "Barack Hussein Obama", "James Carter"))

The resulting data frame should look something like this:

Candidates <- data.frame(name = c("Barack Obama", "George W. Bush", "Jimmy Carter", "Tony Blair", "Mickey Mouse", "Darth Vader"), incumbent = c(1, 0, 1, 1, 0, 0)

EDIT #2: The response by phiver was very helpful, yet I have run into the problem that some names are in my dataset more than once. To identify them uniquely, I would like to use an additional variable Candidates$party and Incumbents$party in the matching process. How would I include this precise additional matching variable in the code?

To modify my example:

Candidates <- data.frame(name = c("Barack Obama", "George W. Bush", "Jimmy Carter", "Tony Blair", "Mickey Mouse", "Darth Vader", "John Smith", "John Smith"), party = c("Democrat", "Republican", "Democrat", "Democrat", "Republican", "Democrat", "Democrat", "Republican") , incumbent = NA)
Incumbents <- data.frame(name = c("Anakin Skywalker", "Sir Tony Blair", "Barack Hussein Obama", "James Carter", "John Smith"), party = ("Republican", "Democrat", "Democrat", "Democrat", "Republican")

The output should be the same as before with the addition of the Republican John Smith as an incumbent.

Candidates <- data.frame(name = c("Barack Obama", "George W. Bush", "Jimmy Carter", "Tony Blair", "Mickey Mouse", "Darth Vader", "John Smith", "John Smith"), party = c("Democrat", "Republican", "Democrat", "Democrat", "Republican", "Democrat", "Democrat", "Republican") , incumbent = c(1, 0, 1, 1, 0, 0, 0, 1)

Solution

  • using the fuzzyjoin package you can match the names in a join.

    With the example given this code will reproduce the expected example output.

    library(fuzzyjoin)
    library(dplyr)
    
    Candidates %>% 
      stringdist_left_join(Incumbents, by = c("name" = "name"), method = "jw", max_dist = 0.2) %>% 
      rename(candidate_name = name.x) %>% 
      mutate(incumbent = if_else(!is.na(name.y), 1, 0)) %>% 
      select(-name.y)
    
      candidate_name incumbent
    1   Barack Obama         1
    2 George W. Bush         0
    3   Jimmy Carter         1
    4     Tony Blair         1
    5   Mickey Mouse         0
    6    Darth Vader         0
    

    Now the reason for jw is that this has been developed for matching names where there are only a few mistakes. The closer to 0 the more correct the name, with 0 being the exactly the same name. Choosing the correct max_dist is a bit of finetuning. Cleaning up some names might help.

    To get get rid of titles like "Sir" etc. you could use regex on the incubent names directly with gsub or use stringr::str_remove before using the join.

    edit to reflect addition:

    You can extend the join to join on multiple columns.

    Candidates %>% 
      stringdist_left_join(Incumbents, by = c("name" = "name", "party" = "party"), 
                           method = "jw",
                           max_dist = 0.2) %>%
      rename(candidate_name = name.x,
             candidate_party = party.x) %>% 
      mutate(incumbent = if_else(!is.na(name.y), 1, 0)) %>% 
      select(-ends_with(".y")) # remove not needed columns coming from incubent table
    
      candidate_name candidate_party incumbent
    1   Barack Obama        Democrat         1
    2 George W. Bush      Republican         0
    3   Jimmy Carter        Democrat         1
    4     Tony Blair        Democrat         1
    5   Mickey Mouse      Republican         0
    6    Darth Vader        Democrat         0
    7     John Smith        Democrat         0
    8     John Smith      Republican         1