I'm having tough time figuring out the amatch function in R.
2 data frames, respondent (with user input) and census. I have already normalized and merged user input with perfect matches from the census. I would now like to do a "closest match" to narrow down the remaining (as of yet unmatched) respondent entries. Do I nest amatch inside of the merge function? I've read the amatch documentation and done an extensive search, but haven't been able to make it work.
I can't even get a TRUE / FALSE output, trying to follow the documentation:
library(stringdist)
closest <- amatch(respondent, census, nomatch = 0, match NA = TRUE)
> dput(respondent)
structure(list(respondent = c(196L, 197L, 204L, 209L, 214L, 215L,
220L, 225L, 226L, 228L, 229L), resp_input = structure(c(1L, 10L,
7L, 8L, 2L, 3L, 9L, 6L, 4L, 5L, 11L), .Label = c("Bramppton",
"Brookfield", "Caledonn", "Cartwright", "Durham Region", "Halifax / ",
"Kinuso", "Kleowna", "St. Paul/Saddle Lake", "Tornto", "watsonlake"
), class = "factor")), .Names = c("respondent", "resp_input"), class = "data.frame", row.names = c(NA,
-11L))
> dput(census)
structure(list(code = c(3520005L, 3521005L, 3521010L, 3521024L,
5935010L, 5953012L, 5953019L, 5953023L, 5953033L, 6001003L, 6001004L,
6001006L), census_title = structure(c(11L, 8L, 1L, 2L, 5L, 7L,
4L, 9L, 6L, 12L, 3L, 10L), .Label = c("Brampton", "Caledon",
"Faro", "Fraser-Fort George H", "Kelowna", "Mackenzie", "McBride",
"Mississauga", "Prince George", "Teslin", "Toronto", "Watson Lake"
), class = "factor")), .Names = c("code", "census_title"), class = "data.frame", row.names = c(NA,
-12L))
In addition to Ronak's answer, check out the fuzzyjoin
package which simplifies the process of joining on a fuzzy string match:
fuzzyjoin::stringdist_join(
x = respondent,
y = census,
by = c("resp_input" = "census_title"),
method = "osa",
ignore_case = TRUE,
max_dist = 3,
mode = "inner"
)
#> respondent resp_input code census_title
#> 1 196 Bramppton 3521010 Brampton
#> 2 197 Tornto 3520005 Toronto
#> 3 209 Kleowna 5935010 Kelowna
#> 4 215 Caledonn 3521024 Caledon
#> 5 229 watsonlake 6001003 Watson Lake
You can play around with the various methods and max_dist
parameter to get the best result for your data.