Say I have a list of counties with varying amounts of spelling errors or other issues that differentiate them from the 2010 FIPS dataset (code to create fips
dataframe below), but the states in which the misspelled counties reside are entered correctly. Here's a sample
of 21 random observations from my full dataset:
tomatch <- structure(list(county = c("Beauregard", "De Soto", "Dekalb", "Webster",
"Saint Joseph", "West Feliciana", "Ketchikan Gateway", "Evangeline",
"Richmond City", "Saint Mary", "Saint Louis City", "Mclean",
"Union", "Bienville", "Covington City", "Martinsville City",
"Claiborne", "King And Queen", "Mclean", "Mcminn", "Prince Georges"
), state = c("LA", "LA", "GA", "LA", "IN", "LA", "AK", "LA", "VA",
"LA", "MO", "KY", "LA", "LA", "VA", "VA", "LA", "VA", "ND", "TN",
"MD")), .Names = c("county", "state"), class = c("tbl_df", "data.frame"
), row.names = c(NA, -21L))
county state
1 Beauregard LA
2 De Soto LA
3 Dekalb GA
4 Webster LA
5 Saint Joseph IN
6 West Feliciana LA
7 Ketchikan Gateway AK
8 Evangeline LA
9 Richmond City VA
10 Saint Mary LA
11 Saint Louis City MO
12 Mclean KY
13 Union LA
14 Bienville LA
15 Covington City VA
16 Martinsville City VA
17 Claiborne LA
18 King And Queen VA
19 Mclean ND
20 Mcminn TN
21 Prince Georges MD
I've used adist
to create a fuzzy string matching algorithm that matches around 80% of my counties to the county names in fips
. However, sometimes it will match two counties with similar spelling, but from different states (e.g., "Webster, LA" gets matched to "Webster, GA" rather than "Webster Parrish, LA").
distance <- adist(tomatch$county,
fips$countyname,
partial = TRUE)
min.name <- apply(distance, 1, min)
matchedcounties <- NULL
for(i in 1:nrow(distance)) {
s2.i <- match(min.name[i], distance[i, ])
s1.i <- i
matchedcounties <- rbind(data.frame(s2.i = s2.i,
s1.i = s1.i,
s1name = tomatch[s1.i, ]$county,
s2name = fips[s2.i, ]$countyname,
adist = min.name[i]),
matchedcounties)
}
Therefore, I want to restrict fuzzy string matching of county to the correctly spelled versions with matching state.
My current algorithm makes one big matrix which calculates standard Levenshtein distances between both sources and then selects the value with the minimum distance.
To solve my problem, I'm guessing I'd need to create a function that could be applied to each 'state' group by ddply
, but I'm confused as to how I should indicate that the group value in the ddply
function should match another dataframe. A dplyr
solution or solution using any other package would be appreciated as well.
Code to create FIPS dataset:
download.file('http://www2.census.gov/geo/docs/reference/codes/files/national_county.txt',
'./nationalfips.txt')
fips <- read.csv('./nationalfips.txt',
stringsAsFactors = FALSE, colClasses = 'character', header = FALSE)
names(fips) <- c('state', 'statefips', 'countyfips', 'countyname', 'classfips')
# remove 'County' from countyname
fips$countyname <- sub('County', '', fips$countyname, fixed = TRUE)
fips$countyname <- stringr::str_trim(fips$countyname)
Here's a way with dplyr. I first join the tomatch
data.frame with the FIPS names by state (allowing only in-state matches):
require(dplyr)
df <- tomatch %>%
left_join(fips, by="state")
Next, I noticed that a lot of counties don't have 'Saint' but 'St.' in the FIPS dataset. Cleaning that up first should improve the results obtained.
df <- df %>%
mutate(county_clean = gsub("Saint", "St.", county))
Then, group this data.frame by county, and calculate the distance with adist:
df <- df %>%
group_by(county_clean) %>% # Calculate the distance per county
mutate(dist = diag(adist(county_clean, countyname, partial=TRUE))) %>%
arrange(county, dist) # Used this for visual inspection.
Note that I took the diagonal from the resulting matrix as adist returns an n x m matrix with n representing the x vector and m representing the y vector (it calculates all of the combinations). Optionally, you could add the agrep result:
df <- df %>%
rowwise() %>% # 'group_by' a single row.
mutate(agrep_result = agrepl(county_clean, countyname, max.distance = 0.3)) %>%
ungroup() # Always a good idea to remove 'groups' after you're done.
Then filter as you did before, take the minimum distance:
df <- df %>%
group_by(county_clean) %>% # Causes it to calculate the 'min' per group
filter(dist == min(dist)) %>%
ungroup()
Note that this could result in more than one row returned for each of the input rows in tomatch
.
Alternatively, do it all in one run (I usually change code to this format once I'm confident it's doing what it's supposed to do):
df <- tomatch %>%
# Join on all names in the relevant state and clean 'St.'
left_join(fips, by="state") %>%
mutate(county_clean = gsub("Saint", "St.", county)) %>%
# Calculate the distances, per original county name.
group_by(county_clean) %>%
mutate(dist = diag(adist(county_clean, countyname, partial=TRUE))) %>%
# Append the agrepl result
rowwise() %>%
mutate(string_agrep = agrepl(county_clean, countyname, max.distance = 0.3)) %>%
ungroup() %>%
# Only retain minimum distances
group_by(county_clean) %>%
filter(dist == min(dist))
The result in both cases:
county county_clean state countyname dist string_agrep
1 Beauregard Beauregard LA Beauregard Parish 0 TRUE
2 De Soto De Soto LA De Soto Parish 0 TRUE
3 Dekalb Dekalb GA DeKalb 1 TRUE
4 Webster Webster LA Webster Parish 0 TRUE
5 Saint Joseph St. Joseph IN St. Joseph 0 TRUE
6 West Feliciana West Feliciana LA West Feliciana Parish 0 TRUE
7 Ketchikan Gateway Ketchikan Gateway AK Ketchikan Gateway Borough 0 TRUE
8 Evangeline Evangeline LA Evangeline Parish 0 TRUE
9 Richmond City Richmond City VA Richmond city 1 TRUE
10 Saint Mary St. Mary LA St. Mary Parish 0 TRUE
11 Saint Louis City St. Louis City MO St. Louis city 1 TRUE
12 Mclean Mclean KY McLean 1 TRUE
13 Union Union LA Union Parish 0 TRUE
14 Bienville Bienville LA Bienville Parish 0 TRUE
15 Covington City Covington City VA Covington city 1 TRUE
16 Martinsville City Martinsville City VA Martinsville city 1 TRUE
17 Claiborne Claiborne LA Claiborne Parish 0 TRUE
18 King And Queen King And Queen VA King and Queen 1 TRUE
19 Mclean Mclean ND McLean 1 TRUE
20 Mcminn Mcminn TN McMinn 1 TRUE
21 Prince Georges Prince Georges MD Prince George's 1 TRU