Search code examples
rmapsstringr

How to identify all country names mentioned in a string and split accordingly?


I have a string that contains country and other region names. I am only interested in the country names and would ideally like to add several columns, each of which contains a country name listed in the string. Here is an exemplary code for the way the dataframe lis set up:

df <- data.frame(id = c(1,2,3),
                 country = c("Cote d'Ivoire Africa Developing Economies West Africa",
                              "South Africa United Kingdom Africa BRICS Countries",
                             "Myanmar Gambia Bangladesh Netherlands Africa Asia"))

If I only split the string by space, those countries which contain a space get lost (e.g. "United Kingdom"). See here:

df2 <- separate(df, country, paste0("C",3:8), sep=" ") 

Therefore, I tried to look up country names using the world.cities dataset. However, this only seems to loop through the string until there is non-country name. See here:

library(maps)
library(stringr)
all_countries <- str_c(unique(world.cities$country.etc), collapse = "|")
df$c1 <- sapply(str_extract_all(df$country, all_countries), toString)

I am wondering whether it's possible to use the space a delimiter but define exceptions (like "United Kingdom"). This might obviously require some manual work, but appears to be most feasible solution to me. Does anyone know how to define such exceptions? I am of course also open to and thankful for any other solutions.

UPDATE:

I figured out another solution using the countrycode package:

library(countrycode)
countries <- data.frame(countryname_dict)
countries$continent <- countrycode(sourcevar = countries[["country.name.en"]],
                                   origin = "country.name.en",
                                   destination = "continent")

africa <- countries[ which(countries$continent=='Africa'), ]

library(stringr)
pat <- paste0("\\b", paste(africa$country.name.en , collapse="\\b|\\b"), "\\b")
df$country_list <- str_extract_all(df$country, regex(pat, ignore_case = TRUE))

Solution

  • You could do:

    library(stringi)
    vec <- stri_trans_general(countrycode::codelist$country.name.en, id = "Latin-ASCII")
    stri_extract_all(df$country,regex = sprintf(r"(\b(%s)\b)",stri_c(vec,collapse = "|")))
    [[1]]
    [1] "Cote d'Ivoire"
    
    [[2]]
    [1] "South Africa"   "United Kingdom"
    
    [[3]]
    [1] "Gambia"      "Bangladesh"  "Netherlands"