Search code examples
rdataframenlpcolumnsorting

Is it possible to get R to identify countries in a dataframe?


This is what my dataset currently looks like. I'm hoping to add a column with the country names that correspond with the 'paragraph' column, but I don't even know how to start going about with that. Should I upload a list of all country names and then use the match function?

Any suggestions for a more optimal way would be appreciated! Thank you.

my current dataset

The output of dput(head(dataset, 20)) is as follows:

structure(list(category = c("State Ownership and Privatization;...row.names = c(NA, 20L), class = "data.frame")

Solution

  • Use the package "countrycode":

    Toy data:

    df <- data.frame(entry_number = 1:5,
                     text = c("a few paragraphs that might contain the country name congo or democratic republic of congo",
                              "More text that might contain myanmar or burma, as well as thailand",
                              "sentences that do not contain a country name can be returned as NA",
                              "some variant of U.S or the united states",
                              "something with an accent samóoa"))
    

    This is how you can match the country names in a separate column:

    library(tidyr)
    library(dplyr)
    #install.packages("countrycode")
    library(countrycode)
    all_country <- countryname_dict %>% 
      # filter out non-ASCII country names:
      filter(grepl('[A-Za-z]', country.name.alt)) %>%
      # define column `country.name.alt` as an atomic vector:
      pull(country.name.alt) %>% 
      # change to lower-case:
      tolower()
    
    # define alternation pattern of all country names:
    library(stringr)
    pattern <- str_c(all_country, collapse = '|')  # A huge alternation pattern!
    
    df %>%
      # extract country name matches
      mutate(country = str_extract_all(tolower(text), pattern))
      entry_number                                                                                       text
    1            1 a few paragraphs that might contain the country name congo or democratic republic of congo
    2            2                         More text that might contain myanmar or burma, as well as thailand
    3            3                         sentences that do not contain a country name can be returned as NA
    4            4                                                   some variant of U.S or the united states
    5            5                                                            something with an accent samóoa
                                  country
    1 congo, democratic republic of congo
    2             myanma, burma, thailand
    3                                    
    4                       united states
    5                              samóoa