I have a very large data that contains a very messy and not uniform address field. I only care to extract a country name out of it. Most of the records contain country and city and some contain other information such as street address, state or province. Because there is so much overlap (streets can be named after countries, cities or states) I was thinking of searching for country and city using R
maps
library in conjunction with fuzzyjoin
. Here is an abbreviated example of what I tried:
library(dplyr)
library(stringr)
library(maps)
library(fuzzyjoin)
partial_address <-
c("London 121280 Ontario Canada",
"Milano, Italy",
"123 First St Columbus OH USA",
"Cali-Valle del Cauca Colombia",
"98765 France Paris",
"Zurich Zurich Switzerland",
"Mexico City Mexico",
"Nagoya 123456 Japan",
"BEIJING BEIJING CHINA",
"Thailand 12345 Bangkok",
"Albania La Guahira Colombia",
"Conakry, Guinea",
"Guinea-Bissau, Bissau",
"Democratic Republic of the Congo Kinshasa")
df <-
data.frame(partial_address) %>%
mutate(partial_address = toupper(partial_address))
data(world.cities)
world_cities <-
distinct(world.cities,
name,
country.etc) %>%
mutate(name = toupper(name),
country.etc = toupper(country.etc))
df_new <-
fuzzy_left_join(df,
world_cities,
by = c("partial_address" = "name",
"partial_address" = "country.etc"),
match_fun = str_detect)
My output looks like this:
> df_new
partial_address name country.etc
1 LONDON 121280 ONTARIO CANADA LONDON CANADA
2 MILANO, ITALY MILAN ITALY
3 123 FIRST ST COLUMBUS OH USA COLUMBUS USA
4 CALI-VALLE DEL CAUCA COLOMBIA CALI COLOMBIA
5 CALI-VALLE DEL CAUCA COLOMBIA COLOMBIA COLOMBIA
6 98765 FRANCE PARIS PARIS FRANCE
7 ZURICH ZURICH SWITZERLAND ZURICH SWITZERLAND
8 MEXICO CITY MEXICO MEXICO CITY MEXICO
9 MEXICO CITY MEXICO XICO MEXICO
10 NAGOYA 123456 JAPAN AGO JAPAN
11 NAGOYA 123456 JAPAN NAGO JAPAN
12 NAGOYA 123456 JAPAN NAGOYA JAPAN
13 BEIJING BEIJING CHINA BEIJING CHINA
14 THAILAND 12345 BANGKOK BANGKOK THAILAND
15 ALBANIA LA GUAHIRA COLOMBIA ALBAN COLOMBIA
16 ALBANIA LA GUAHIRA COLOMBIA ALBANIA COLOMBIA
17 ALBANIA LA GUAHIRA COLOMBIA COLOMBIA COLOMBIA
18 CONAKRY, GUINEA CONAKRY GUINEA
19 GUINEA-BISSAU, BISSAU BISSAU GUINEA-BISSAU
20 DEMOCRATIC REPUBLIC OF THE CONGO KINSHASA <NA> <NA>
As you can see, it is not exactly what I want as apparently Mexico has a city named "Xico" and Colombia has a city named "Colombia", etc. Expected output should look something like this:
> df_new
partial_address name country.etc
1 LONDON 121280 ONTARIO CANADA LONDON CANADA
2 MILANO, ITALY MILAN ITALY
3 123 FIRST ST COLUMBUS OH USA COLUMBUS USA
4 CALI-VALLE DEL CAUCA COLOMBIA CALI COLOMBIA
5 98765 FRANCE PARIS PARIS FRANCE
6 ZURICH ZURICH SWITZERLAND ZURICH SWITZERLAND
7 MEXICO CITY MEXICO MEXICO CITY MEXICO
8 NAGOYA 123456 JAPAN NAGOYA JAPAN
9 BEIJING BEIJING CHINA BEIJING CHINA
10 THAILAND 12345 BANGKOK BANGKOK THAILAND
11 ALBANIA LA GUAHIRA COLOMBIA ALBANIA COLOMBIA
12 CONAKRY, GUINEA CONAKRY GUINEA
13 GUINEA-BISSAU, BISSAU BISSAU GUINEA-BISSAU
14 DEMOCRATIC REPUBLIC OF THE CONGO KINSHASA KINSHASA CONGO DEMOCRATIC REPUBLIC
In addition, fuzzy_join
tends to be very slow and use a lot of resources. Any suggestions on how to reduce false positive matches and improve performance are appreciated. Thank you.
One approach could be to use the list of countries in maps
and check it against the strings in your data and extract where there is a match.
# Load required packages.
library(tidyverse)
library(maps)
# Load and reformat data.
partial_address <-
c("London 121280 Ontario Canada",
"Milano, Italy",
"123 First St Columbus OH USA",
"Cali-Valle del Cauca Colombia",
"98765 France Paris",
"Zurich Zurich Switzerland",
"Mexico City Mexico",
"Nagoya 123456 Japan",
"BEIJING BEIJING CHINA",
"Thailand 12345 Bangkok")
df <-
data.frame(partial_address) %>%
mutate(partial_address = toupper(partial_address))
# Get country data.
world_map_data <- map_data("world")
country_names <- unique(world_map_data$region) %>%
toupper()
# Function to match country data with partial address.
find_country <- function(partial_address, countries) {
for (country in countries) {
if (str_detect(partial_address, fixed(country))) {
return(country)
}
}
return(NA)
}
df$country <- map_chr(df$partial_address, ~find_country(., country_names))
With the output of df
:
No. | Partial Address | Country |
---|---|---|
1 | LONDON 121280 ONTARIO CANADA | CANADA |
2 | MILANO, ITALY | ITALY |
3 | 123 FIRST ST COLUMBUS OH USA | USA |
4 | CALI-VALLE DEL CAUCA COLOMBIA | COLOMBIA |
5 | 98765 FRANCE PARIS | FRANCE |
6 | ZURICH ZURICH SWITZERLAND | SWITZERLAND |
7 | MEXICO CITY MEXICO | MEXICO |
8 | NAGOYA 123456 JAPAN | JAPAN |
9 | BEIJING BEIJING CHINA | CHINA |
10 | THAILAND 12345 BANGKOK | THAILAND |