I have a dataset that contains information on events taken place around world. My intention is to aggregate this data to country-year level. But before doing that, I want to create a variable "capital.city", indicating whether an event has taken place in a capital city or not.
What I've done so far - consulting the AI Bing – is this:
library(countrycode)
library(maps)
# Load the world cities dataset
data("world.cities")
# Create a list of capital cities
capital_cities <- unique(world.cities$capital)
# Create a new variable indicating whether a city is a capital or not
dt_protest$capital_city <- ifelse(dt_protest$city %in% capital_cities, "capital", "non-capital")
But this doesn't work really - I get only non-capital values. What am I doing wrong?
Here's the sample of my data:
date month year city country
4/4/2006 4 2006 Lyon France
5/23/2021 5 2021 Abeokuta Nigeria
3/19/1996 3 1996 Kuala Lumpur Malaysia
11/30/2006 11 2006 Moscow Russia
11/30/2011 11 2011 Tinsukia India
1/4/2014 1 2014 Saharsa India
11/23/2016 11 2016 Venezuela Cuba
9/27/2019 9 2019 Shanghai China
5/22/2003 5 2003 Bonn Germany
12/7/2006 12 2006 Thetford United Kingdom
9/10/2010 9 2010 New Delhi India
11/17/2020 11 2020 Helsinki Finland
1/22/2011 1 2011 Berlin Germany
3/19/1993 3 1993 Jerusalem Israel
8/2/2004 8 2004 Mumbai India
12/9/2000 12 2000 Mumbai India
8/29/2001 8 2001 Guelph Canada
4/7/2003 4 2003 Seoul South Korea
9/11/2003 9 2003 Brussels Belgium
4/5/2006 4 2006 Hong Kong China
2/1/2007 2 2007 Kathmandu Nepal
10/4/2007 10 2007 Moscow Russia
9/3/2008 9 2008 Luanda Angola
10/21/2009 10 2009 JohannesburgSouth Africa
2/20/2010 2 2010 TashkentUzbekistan
7/20/2010 7 2010 Singur India
10/24/2011 10 2011 SrinagarIndia
11/14/2012 11 2012 Delhi India
1/2/2015 1 2015 Cairo Egypt
10/13/2015 10 2015 TinsukiaIndia
Bing's AI suggestion of capital_cities <- unique(world.cities$capital)
doesn't create a list of capital cities (surprise, AI led you astray!) - it creates a vector of integers of length 4 (c(0, 1, 3, 2)
) which are the unique values for that column and do not take on any city names.
You are getting all non-capital values because the city value will never take on the values of 0, 1, 2, or 3, so defaults to the "else" aspect of ifelse
, which is "not capital".
If just using the city as the indicator, you should do:
capitals <- unique(world.cities[world.cities$capital > 0, "name"])
Then you can use ab ifelse
statement to create the new variable:
df <- data.frame(country = c("China", "China", "Serbia", "Serbia", "Germany", "Germany"),
city = c("Beibei", "Beijing", "Bavaniste", "Belgrade", "Bayreuth" ,"Berlin"))
capitals <- unique(world.cities[world.cities$capital > 0, "name"])
df["capital"] <- ifelse(df$city %in% capitals,
"capital",
"not capital")
However, this may cause a problem if there is a city in two countries where one is a capital and one is not. Paris, France and Paris, Indiana, USA are very different places. A "safer" approach may be to use merge
on both the city and the country:
capitals <- unique(world.cities[world.cities$capital > 0, c("name", "country.etc", "capital")])
capdat <- merge(df, capitals,
by.x = c("country", "city"),
by.y = c("country.etc", "name"),
all.x = TRUE)
capdat$capital <- ifelse(!is.na(capdat$capital), "capital", "not capital")
In these data, the output are both:
country city capital
1 China Beibei not capital
2 China Beijing capital
3 Serbia Bavaniste not capital
4 Serbia Belgrade capital
5 Germany Bayreuth not capital
6 Germany Berlin capital
Note the world.cities
dataset indicates additional administrative capitals in China as 2 (municipal capital) or 3 (provincial capital) - see ?world.cities
. If you dont want to include those, change to unique(world.cities[world.cities$capital == 1, "name"])
.