Search code examples
rdata-cleaning

R cleaning data - amatch to merge close matches from 2 separate data frames


I'm having tough time figuring out the amatch function in R.

2 data frames, respondent (with user input) and census. I have already normalized and merged user input with perfect matches from the census. I would now like to do a "closest match" to narrow down the remaining (as of yet unmatched) respondent entries. Do I nest amatch inside of the merge function? I've read the amatch documentation and done an extensive search, but haven't been able to make it work.

I can't even get a TRUE / FALSE output, trying to follow the documentation:

library(stringdist)
closest <- amatch(respondent, census, nomatch = 0, match NA = TRUE)

respondent census desired

> dput(respondent)
structure(list(respondent = c(196L, 197L, 204L, 209L, 214L, 215L, 
220L, 225L, 226L, 228L, 229L), resp_input = structure(c(1L, 10L, 
7L, 8L, 2L, 3L, 9L, 6L, 4L, 5L, 11L), .Label = c("Bramppton", 
"Brookfield", "Caledonn", "Cartwright", "Durham Region", "Halifax / ", 
"Kinuso", "Kleowna", "St. Paul/Saddle Lake", "Tornto", "watsonlake"
), class = "factor")), .Names = c("respondent", "resp_input"), class = "data.frame", row.names = c(NA, 
-11L))
> dput(census)
structure(list(code = c(3520005L, 3521005L, 3521010L, 3521024L, 
5935010L, 5953012L, 5953019L, 5953023L, 5953033L, 6001003L, 6001004L, 
6001006L), census_title = structure(c(11L, 8L, 1L, 2L, 5L, 7L, 
4L, 9L, 6L, 12L, 3L, 10L), .Label = c("Brampton", "Caledon", 
"Faro", "Fraser-Fort George H", "Kelowna", "Mackenzie", "McBride", 
"Mississauga", "Prince George", "Teslin", "Toronto", "Watson Lake"
), class = "factor")), .Names = c("code", "census_title"), class = "data.frame", row.names = c(NA, 
-12L))

Solution

  • In addition to Ronak's answer, check out the fuzzyjoin package which simplifies the process of joining on a fuzzy string match:

    fuzzyjoin::stringdist_join(
      x = respondent, 
      y = census, 
      by = c("resp_input" = "census_title"),
      method = "osa",
      ignore_case = TRUE,
      max_dist = 3,
      mode = "inner"
      )
    
    #>   respondent resp_input    code census_title
    #> 1        196  Bramppton 3521010     Brampton
    #> 2        197     Tornto 3520005      Toronto
    #> 3        209    Kleowna 5935010      Kelowna
    #> 4        215   Caledonn 3521024      Caledon
    #> 5        229 watsonlake 6001003  Watson Lake
    

    You can play around with the various methods and max_dist parameter to get the best result for your data.