Search code examples
rstringrfuzzyjoin

How can I parse out city and country information from a messy and non-uniform partial address column in an R data frame?


I have a very large data that contains a very messy and not uniform address field. I only care to extract a country name out of it. Most of the records contain country and city and some contain other information such as street address, state or province. Because there is so much overlap (streets can be named after countries, cities or states) I was thinking of searching for country and city using R maps library in conjunction with fuzzyjoin. Here is an abbreviated example of what I tried:

library(dplyr)
library(stringr)
library(maps)
library(fuzzyjoin)

partial_address <-
  c("London 121280 Ontario Canada",
    "Milano, Italy",
    "123 First St Columbus OH USA",
    "Cali-Valle del Cauca Colombia",
    "98765 France Paris",
    "Zurich Zurich Switzerland",
    "Mexico City Mexico",
    "Nagoya 123456 Japan",
    "BEIJING BEIJING CHINA",
    "Thailand 12345 Bangkok",
    "Albania La Guahira Colombia",
    "Conakry, Guinea",
    "Guinea-Bissau, Bissau",
    "Democratic Republic of the Congo Kinshasa")

df <-
  data.frame(partial_address) %>%
  mutate(partial_address = toupper(partial_address))

data(world.cities)

world_cities <-
  distinct(world.cities,
           name,
           country.etc) %>%
  mutate(name = toupper(name),
         country.etc = toupper(country.etc))

df_new <-
  fuzzy_left_join(df,
                  world_cities,
                  by = c("partial_address" = "name",
                         "partial_address" = "country.etc"),
                  match_fun = str_detect)

My output looks like this:

> df_new
                             partial_address        name   country.etc
1               LONDON 121280 ONTARIO CANADA      LONDON        CANADA
2                              MILANO, ITALY       MILAN         ITALY
3               123 FIRST ST COLUMBUS OH USA    COLUMBUS           USA
4              CALI-VALLE DEL CAUCA COLOMBIA        CALI      COLOMBIA
5              CALI-VALLE DEL CAUCA COLOMBIA    COLOMBIA      COLOMBIA
6                         98765 FRANCE PARIS       PARIS        FRANCE
7                  ZURICH ZURICH SWITZERLAND      ZURICH   SWITZERLAND
8                         MEXICO CITY MEXICO MEXICO CITY        MEXICO
9                         MEXICO CITY MEXICO        XICO        MEXICO
10                       NAGOYA 123456 JAPAN         AGO         JAPAN
11                       NAGOYA 123456 JAPAN        NAGO         JAPAN
12                       NAGOYA 123456 JAPAN      NAGOYA         JAPAN
13                     BEIJING BEIJING CHINA     BEIJING         CHINA
14                    THAILAND 12345 BANGKOK     BANGKOK      THAILAND
15               ALBANIA LA GUAHIRA COLOMBIA       ALBAN      COLOMBIA
16               ALBANIA LA GUAHIRA COLOMBIA     ALBANIA      COLOMBIA
17               ALBANIA LA GUAHIRA COLOMBIA    COLOMBIA      COLOMBIA
18                           CONAKRY, GUINEA     CONAKRY        GUINEA
19                     GUINEA-BISSAU, BISSAU      BISSAU GUINEA-BISSAU
20 DEMOCRATIC REPUBLIC OF THE CONGO KINSHASA        <NA>          <NA>

As you can see, it is not exactly what I want as apparently Mexico has a city named "Xico" and Colombia has a city named "Colombia", etc. Expected output should look something like this:

> df_new
                             partial_address        name   country.etc
1               LONDON 121280 ONTARIO CANADA      LONDON        CANADA
2                              MILANO, ITALY       MILAN         ITALY
3               123 FIRST ST COLUMBUS OH USA    COLUMBUS           USA
4              CALI-VALLE DEL CAUCA COLOMBIA        CALI      COLOMBIA
5                         98765 FRANCE PARIS       PARIS        FRANCE
6                  ZURICH ZURICH SWITZERLAND      ZURICH   SWITZERLAND
7                         MEXICO CITY MEXICO MEXICO CITY        MEXICO
8                        NAGOYA 123456 JAPAN      NAGOYA         JAPAN
9                      BEIJING BEIJING CHINA     BEIJING         CHINA
10                    THAILAND 12345 BANGKOK     BANGKOK      THAILAND
11               ALBANIA LA GUAHIRA COLOMBIA     ALBANIA      COLOMBIA
12                           CONAKRY, GUINEA     CONAKRY        GUINEA
13                     GUINEA-BISSAU, BISSAU     BISSAU  GUINEA-BISSAU
14 DEMOCRATIC REPUBLIC OF THE CONGO KINSHASA     KINSHASA CONGO DEMOCRATIC REPUBLIC

In addition, fuzzy_join tends to be very slow and use a lot of resources. Any suggestions on how to reduce false positive matches and improve performance are appreciated. Thank you.


Solution

  • One approach could be to use the list of countries in maps and check it against the strings in your data and extract where there is a match.

    # Load required packages.
    
    library(tidyverse)
    library(maps)
    
    # Load and reformat data.
    
    partial_address <-
      c("London 121280 Ontario Canada",
        "Milano, Italy",
        "123 First St Columbus OH USA",
        "Cali-Valle del Cauca Colombia",
        "98765 France Paris",
        "Zurich Zurich Switzerland",
        "Mexico City Mexico",
        "Nagoya 123456 Japan",
        "BEIJING BEIJING CHINA",
        "Thailand 12345 Bangkok")
    
    df <-
      data.frame(partial_address) %>%
      mutate(partial_address = toupper(partial_address))
    
    # Get country data.
    
    world_map_data <- map_data("world")
    
    country_names <- unique(world_map_data$region) %>%
      toupper()
    
    # Function to match country data with partial address.
    
    find_country <- function(partial_address, countries) {
      for (country in countries) {
        if (str_detect(partial_address, fixed(country))) {
          return(country)
        }
      }
      return(NA)
    }
    
    df$country <- map_chr(df$partial_address, ~find_country(., country_names))
    
    

    With the output of df:

    No. Partial Address Country
    1 LONDON 121280 ONTARIO CANADA CANADA
    2 MILANO, ITALY ITALY
    3 123 FIRST ST COLUMBUS OH USA USA
    4 CALI-VALLE DEL CAUCA COLOMBIA COLOMBIA
    5 98765 FRANCE PARIS FRANCE
    6 ZURICH ZURICH SWITZERLAND SWITZERLAND
    7 MEXICO CITY MEXICO MEXICO
    8 NAGOYA 123456 JAPAN JAPAN
    9 BEIJING BEIJING CHINA CHINA
    10 THAILAND 12345 BANGKOK THAILAND