Search code examples
rtidyversecountry-codes

How to create a variable in R indicating whether a city is a capital or not?


I have a dataset that contains information on events taken place around world. My intention is to aggregate this data to country-year level. But before doing that, I want to create a variable "capital.city", indicating whether an event has taken place in a capital city or not.

What I've done so far - consulting the AI Bing – is this:

library(countrycode)
library(maps)

# Load the world cities dataset
data("world.cities")

# Create a list of capital cities
capital_cities <- unique(world.cities$capital)

# Create a new variable indicating whether a city is a capital or not
dt_protest$capital_city <- ifelse(dt_protest$city %in% capital_cities, "capital", "non-capital")

But this doesn't work really - I get only non-capital values. What am I doing wrong?

Here's the sample of my data:

date    month   year    city    country
4/4/2006    4   2006    Lyon    France
5/23/2021   5   2021    Abeokuta    Nigeria
3/19/1996   3   1996    Kuala Lumpur    Malaysia
11/30/2006  11  2006    Moscow  Russia
11/30/2011  11  2011    Tinsukia    India
1/4/2014    1   2014    Saharsa India
11/23/2016  11  2016    Venezuela   Cuba
9/27/2019   9   2019    Shanghai    China
5/22/2003   5   2003    Bonn    Germany
12/7/2006   12  2006    Thetford    United Kingdom
9/10/2010   9   2010    New Delhi   India
11/17/2020  11  2020    Helsinki    Finland
1/22/2011   1   2011    Berlin  Germany
3/19/1993   3   1993    Jerusalem   Israel
8/2/2004    8   2004    Mumbai  India
12/9/2000   12  2000    Mumbai  India
8/29/2001   8   2001    Guelph  Canada
4/7/2003    4   2003    Seoul   South Korea
9/11/2003   9   2003    Brussels    Belgium
4/5/2006    4   2006    Hong Kong   China
2/1/2007    2   2007    Kathmandu   Nepal
10/4/2007   10  2007    Moscow  Russia
9/3/2008    9   2008    Luanda  Angola
10/21/2009  10  2009    JohannesburgSouth Africa
2/20/2010   2   2010    TashkentUzbekistan
7/20/2010   7   2010    Singur  India
10/24/2011  10  2011    SrinagarIndia
11/14/2012  11  2012    Delhi   India
1/2/2015    1   2015    Cairo   Egypt
10/13/2015  10  2015    TinsukiaIndia

Solution

  • Bing's AI suggestion of capital_cities <- unique(world.cities$capital) doesn't create a list of capital cities (surprise, AI led you astray!) - it creates a vector of integers of length 4 (c(0, 1, 3, 2)) which are the unique values for that column and do not take on any city names.

    You are getting all non-capital values because the city value will never take on the values of 0, 1, 2, or 3, so defaults to the "else" aspect of ifelse, which is "not capital".

    If just using the city as the indicator, you should do:

    capitals <- unique(world.cities[world.cities$capital > 0, "name"])
    

    Then you can use ab ifelse statement to create the new variable:

    df <- data.frame(country = c("China", "China", "Serbia", "Serbia", "Germany", "Germany"),
                     city = c("Beibei", "Beijing", "Bavaniste", "Belgrade", "Bayreuth" ,"Berlin"))
    
    capitals <- unique(world.cities[world.cities$capital > 0, "name"])
    
    df["capital"] <- ifelse(df$city %in% capitals, 
                            "capital", 
                            "not capital")
    

    However, this may cause a problem if there is a city in two countries where one is a capital and one is not. Paris, France and Paris, Indiana, USA are very different places. A "safer" approach may be to use merge on both the city and the country:

    capitals <- unique(world.cities[world.cities$capital > 0, c("name", "country.etc", "capital")])
    
    capdat <- merge(df, capitals,
      by.x = c("country", "city"),
      by.y = c("country.etc", "name"),
      all.x = TRUE)
    
    capdat$capital <- ifelse(!is.na(capdat$capital), "capital", "not capital")
    

    In these data, the output are both:

     country      city     capital
    1   China    Beibei not capital
    2   China   Beijing     capital
    3  Serbia Bavaniste not capital
    4  Serbia  Belgrade     capital
    5 Germany  Bayreuth not capital
    6 Germany    Berlin     capital
    

    Note the world.cities dataset indicates additional administrative capitals in China as 2 (municipal capital) or 3 (provincial capital) - see ?world.cities. If you dont want to include those, change to unique(world.cities[world.cities$capital == 1, "name"]).