I have a dataset with more than 20.000 observations that basically looks like this one:
df <- data.frame(
user = c("ABC", "DEF", "GHI"),
location = c("Chicago, the windy city", "Oxford University", "Paris")
)
I want to add three additional columns city
, long
, lat
and fill these columns with the city name, and the geolocations (longitude and latitude).
Therefore I thought to use the maps
package and its world.cities
database:
library(maps)
data(world.cities)
Adding the city names and geolocation would not be difficult, if the city names in location
would be displayed in the right way. However, most of them do have additional strings (e.g. "Chicago, the windy city").
How can I extract just the city names based on the world.cities
database and write the real city name to the column city
and the geolocations to long
and lat
?
As per mentioned by @Heroka in the comments, if the name of the city is always the first string in location
, you could extract the first string using stringi
, left_join
the world.cities data, and filter for the largest population in the matches.
library(stringi)
library(dplyr)
df %>%
mutate(city = stri_extract_first_words(location)) %>%
left_join(world.cities, by = c("city" = "name")) %>%
group_by(city) %>%
filter(row_number(desc(pop)) == 1)
Which gives:
#Source: local data frame [3 x 8]
#Groups: city [3]
#
# user location city country.etc pop lat long capital
# (fctr) (fctr) (chr) (chr) (int) (dbl) (dbl) (int)
#1 ABC Chicago, the windy city Chicago USA 2830144 41.84 -87.68 0
#2 DEF Oxford University Oxford UK 157568 51.76 -1.26 0
#3 GHI Paris Paris France 2141839 48.86 2.34 1
Update
If the name of the city is not always the first string in location
, you could first try to match the words in location
with a dictionary (here the name
column in world.cities) and then use the matches that return TRUE
as your location name. Here's a quick implementation (I added the "University College London" case to you data.frame)
> df
# user location
#1 ABC Chicago, the windy city
#2 DEF Oxford University
#3 GHI Paris
#4 JKL University College London
For each row, we extract all the words in location
and store them in a list lst
, loop over it to find the position of the matching name
in world.cities and store it in p
, and finally extract the element corresponding to position p
in lst
and store it in city
df %>%
mutate(lst = stri_extract_all_words(location),
p = sapply(lst, function (x) which(x %in% world.cities$name), simplify=TRUE)) %>%
mutate(city = sapply(1:length(lst), function(x) .$lst[[x]][.$p[x]])) %>%
left_join(world.cities, by = c("city" = "name")) %>%
group_by(city) %>%
filter(row_number(desc(pop)) == 1)
You can also remove the temporary columns p
and lst
by adding ... %>% select(-lst, -p)
Update 2
This should not break on malformed words but won't work for the "New York" case:
df %>%
mutate(
city = lapply(stri_extract_all_words(location),
function (x) { world.cities$name[match(x, world.cities$name)] })) %>%
tidyr::unnest(city) %>%
filter(!is.na(city)) %>%
left_join(world.cities, by = c("city" = "name")) %>%
group_by(city) %>%
filter(row_number(desc(pop)) == 1)
Update 3
This should work in all the cases:
> df
# user location
#1 ABC Chicago, the windy city
#2 DEF Oxford University
#3 GHI Paris
#4 JKL New York
#5 MNO m0ntr3al
#6 PQR University College London
df$l <- gsub("[^[:alnum:]]+", " ", df$location)
lst <- lapply(world.cities$name, function (x) { grep(x, df$l, value = TRUE) })
m <- data.table::melt(lst)
df %>%
left_join(m, by = c("l" = "value")) %>%
left_join(world.cities %>%
add_rownames %>%
mutate(rowname = as.numeric(rowname)),
by = c("L1" = "rowname")) %>%
tidyr::replace_na(list(pop = 0)) %>%
group_by(location) %>%
filter(row_number(desc(pop)) == 1) %>%
select(-(l:L1))
Which gives:
#Source: local data frame [6 x 8]
#Groups: location [6]
#
# user location name country.etc pop lat long capital
# (fctr) (fctr) (chr) (chr) (dbl) (dbl) (dbl) (int)
#1 ABC Chicago, the windy city Chicago USA 2830144 41.84 -87.68 0
#2 DEF Oxford University Oxford UK 157568 51.76 -1.26 0
#3 GHI Paris Paris France 2141839 48.86 2.34 1
#4 JKL New York New York USA 8124427 40.67 -73.94 0
#5 MNO m0ntr3al NA NA 0 NA NA NA
#6 PQR Univeristy College London London UK 7489022 51.52 -0.10 1