Adding city names and geolocation data to dataframe

I have a dataset with more than 20.000 observations that basically looks like this one:

df <- data.frame(
    user = c("ABC", "DEF", "GHI"),
    location = c("Chicago, the windy city", "Oxford University", "Paris")
)

I want to add three additional columns city, long, lat and fill these columns with the city name, and the geolocations (longitude and latitude).

Therefore I thought to use the maps package and its world.cities database:

library(maps)
data(world.cities)

Adding the city names and geolocation would not be difficult, if the city names in location would be displayed in the right way. However, most of them do have additional strings (e.g. "Chicago, the windy city").

How can I extract just the city names based on the world.cities database and write the real city name to the column city and the geolocations to long and lat?

Solution

As per mentioned by @Heroka in the comments, if the name of the city is always the first string in location, you could extract the first string using stringi, left_join the world.cities data, and filter for the largest population in the matches.

library(stringi)
library(dplyr)

df %>%
  mutate(city = stri_extract_first_words(location)) %>%
  left_join(world.cities, by = c("city" = "name")) %>%
  group_by(city) %>%
  filter(row_number(desc(pop)) == 1)

Which gives:

#Source: local data frame [3 x 8]
#Groups: city [3]
#
#    user                location    city country.etc     pop   lat   long capital
#  (fctr)                  (fctr)   (chr)       (chr)   (int) (dbl)  (dbl)   (int)
#1    ABC Chicago, the windy city Chicago         USA 2830144 41.84 -87.68       0
#2    DEF       Oxford University  Oxford          UK  157568 51.76  -1.26       0
#3    GHI                   Paris   Paris      France 2141839 48.86   2.34       1

Update

If the name of the city is not always the first string in location, you could first try to match the words in location with a dictionary (here the name column in world.cities) and then use the matches that return TRUE as your location name. Here's a quick implementation (I added the "University College London" case to you data.frame)

> df
#  user                  location
#1  ABC   Chicago, the windy city
#2  DEF         Oxford University
#3  GHI                     Paris
#4  JKL University College London

For each row, we extract all the words in location and store them in a list lst, loop over it to find the position of the matching name in world.cities and store it in p, and finally extract the element corresponding to position p in lst and store it in city

df %>%
  mutate(lst = stri_extract_all_words(location),
         p = sapply(lst, function (x) which(x %in% world.cities$name), simplify=TRUE)) %>%
  mutate(city = sapply(1:length(lst), function(x) .$lst[[x]][.$p[x]])) %>%
  left_join(world.cities, by = c("city" = "name")) %>%
  group_by(city) %>%
  filter(row_number(desc(pop)) == 1)

You can also remove the temporary columns p and lst by adding ... %>% select(-lst, -p)

Update 2

This should not break on malformed words but won't work for the "New York" case:

df %>%
  mutate(
    city = lapply(stri_extract_all_words(location), 
                  function (x) { world.cities$name[match(x, world.cities$name)] })) %>%
  tidyr::unnest(city) %>%
  filter(!is.na(city)) %>%
  left_join(world.cities, by = c("city" = "name")) %>%
  group_by(city) %>%
  filter(row_number(desc(pop)) == 1)

Update 3

This should work in all the cases:

> df
#  user                  location
#1  ABC   Chicago, the windy city
#2  DEF         Oxford University
#3  GHI                     Paris
#4  JKL                  New York
#5  MNO                  m0ntr3al
#6  PQR University College London

df$l <- gsub("[^[:alnum:]]+", " ", df$location)
lst  <- lapply(world.cities$name, function (x) { grep(x, df$l, value = TRUE) })
m    <- data.table::melt(lst)

df %>% 
  left_join(m, by = c("l" = "value")) %>%
  left_join(world.cities %>% 
              add_rownames %>% 
              mutate(rowname = as.numeric(rowname)), 
            by = c("L1" = "rowname")) %>% 
  tidyr::replace_na(list(pop = 0)) %>%
  group_by(location) %>%
  filter(row_number(desc(pop)) == 1) %>%
  select(-(l:L1))

Which gives:

#Source: local data frame [6 x 8]
#Groups: location [6]
#
#    user                  location     name country.etc     pop   lat   long capital
#  (fctr)                    (fctr)    (chr)       (chr)   (dbl) (dbl)  (dbl)   (int)
#1    ABC   Chicago, the windy city  Chicago         USA 2830144 41.84 -87.68       0
#2    DEF         Oxford University   Oxford          UK  157568 51.76  -1.26       0
#3    GHI                     Paris    Paris      France 2141839 48.86   2.34       1
#4    JKL                  New York New York         USA 8124427 40.67 -73.94       0
#5    MNO                  m0ntr3al       NA          NA       0    NA     NA      NA
#6    PQR Univeristy College London   London          UK 7489022 51.52  -0.10       1