Search code examples
rfuzzyjoin

Grouping two data frames using stringdist_join


I am currently working on a project and have reached a problem... I am trying to match two data frames based on a candidate's name. I have managed to do this, however with anything more than a max_dist of 2 I start to get duplicate entries. However, these would be easily avoided if I could 'group' the candidates by race (state and district) before running stringdist_join as there are only a few candidates in each race with very little chance of having two candidates with similar names.

The goal is to obtain a table called tmpJoin where I can have both the candidateID and the canVotes, along with the name, state, district.

Any suggestions would be greatly appreciated!

Below is my code as well as a replication of the two datasets

state <- c('AL','AL','AL','AL','AL','NY','NY','NY','NY','NY')  
district <-c('01','02','02','03','01','01','02','01','02','02')  
FullName <-c('Sonny Callahan','Tom Bevill','Faye Baggiano','Thomas
Bevill','Don Sledge','William Turner', 'Bill Turner','Ed Smith','Tom
Bevill','Edward Smith')
canVotes <-c('234','589','9234','729','149','245','879','385','8712','7099')

yearHouseResult <- data.frame(state, district, FullName,canVotes)

state <- c('AL','AL','AL','AL','AL','NY','NY','NY','NY','NY')
district <-c('01','02','02','03','01','01','02','01','02','02')
FullName <-c('Sonny Callahan','Tom Beville','Faye Baggiano','Thom Bevill','Donald Sledge','Bill Turner', 'Bill Turner','Ed Smith','Tom Bevill','Ed Smith')
candidateID <- c('1','2','3','4','5','6','7','8','9','10')

congrCands <- data.frame(state, district, FullName, candidateID)

tmpJoin <- stringdist_join(congrCands, yearHouseResult, 
                           by = "FullName",
                           max_dist=2,
                           method = "osa",
                           ignore_case = FALSE,
                           distance_col = "matchingDistance")

Solution

  • You can test all three conditions with fuzzy_inner_join, also from the fuzzyjoin package.

    First I had to change the factors into numerics and characters, because different factor levels will mess with the function.

    Some information to the fuzzy_join. In argument match_fun is the description of the three conditions and in by the columns for the conditions are specified.

    1. stringdist < 4 for FullName
    2. district must be equal
    3. state must be equal (district is a numeric, state is a character, therefore two different functions are needed to compare these columns)

    The table includes more columns than you need. So you might select the needed columns. I just thought it would be easier to controll the matches this way.

    yearHouseResult <- data.frame(state, district, FullName,canVotes) %>%
      mutate(state = as.character(state),
             district = as.numeric(district),
             FullName = as.character(FullName))
    
    
    congrCands <- data.frame(state, district, FullName, candidateID) %>%
      mutate(state = as.character(state),
             district = as.numeric(district),
             FullName = as.character(FullName))
    
    t <- fuzzy_inner_join(congrCands, yearHouseResult, 
                         match_fun = list(function(x,y) stringdist(x,y, 
                                          method="osa") < 4, 
                                          `==`, 
                                          function(x,y) str_detect(x,y)),
                         by = c( "FullName", "district", "state"))
    

    If you increase the number of stringdist from 4 to 5 you will correctly match Ed/Edward Smith but incorrectly match William/Bill Turner. So you need to decide whats more important a clean match or more matches.