Faster way to process 1.2 million JSON geolocation queries from large dataframe

I am working on the Gowalla location-based checkin dataset which has around 6.44 million checkins. Unique locations on these checkins are 1.28 million. But Gowalla only gives latitudes and longitudes. So I need to find city, state and country for each of those lats and longs. From another post on StackOverflow I was able to create the R query below which queries the open street maps and finds the relevant geographical details.

Unfortunately it takes around 1 minute to process 125 rows, which means 1.28 million rows would take a couple of days. Is there a faster way to find these details? Maybe there is some package with builtin lats and longs of cities of the world to find the city name for the given lat and long so I do not have to do online querying.

Venue table is a data frame with 3 columns: 1: vid(venueId), 2 lat(latitude), 3: long(longitude)

for(i in 1:nrow(venueTable)) {
 #this is just an indicator to display current value of i on screen
 cat(paste(".",i,".")) 

 #Below code composes the url query 
 url <- paste("http://nominatim.openstreetmap.org/reverse.php? format=json&lat=",
              venueTableTest3$lat[i],"&lon=",venueTableTest3$long[i])
 url <- gsub(' ','',url)
 url <- paste(url)
 x <- fromJSON(url)
 venueTableTest3$display_name[i] <- x$display_name
 venueTableTest3$country[i] <- x$address$country
}

I am using the jsonlite package in R which makes x which is the result of the JSON query as a dataframe which stores various results returned. So using x$display_name or x$address$city i use my required field.

My laptop is core i5 3230M with 8gb ram and 120gb SSD using Windows 8.

Solution

You're going to have issues even if you persevere with time. The service you're querying allows 'an absolute maximum of one request per second', which you're already breaching. It's likely that they will throttle your requests before you reach 1.2m queries. Their website notes similar APIs for larger uses have only around 15k free daily requests.

It'd be much better for you to use an offline option. A quick search shows that there are many freely available datasets of populated places, along with their longitude and latitudes. Here's one we'll use: http://simplemaps.com/resources/world-cities-data

> library(dplyr)

> cities.data <- read.csv("world_cities.csv") %>% tbl_df
> print(cities.data)

Source: local data frame [7,322 x 9]

             city     city_ascii     lat     lng    pop     country   iso2   iso3 province
           (fctr)         (fctr)   (dbl)   (dbl)  (dbl)      (fctr) (fctr) (fctr)   (fctr)
1   Qal eh-ye Now      Qal eh-ye 34.9830 63.1333   2997 Afghanistan     AF    AFG  Badghis
2     Chaghcharan    Chaghcharan 34.5167 65.2500  15000 Afghanistan     AF    AFG     Ghor
3     Lashkar Gah    Lashkar Gah 31.5830 64.3600 201546 Afghanistan     AF    AFG  Hilmand
4          Zaranj         Zaranj 31.1120 61.8870  49851 Afghanistan     AF    AFG   Nimroz
5      Tarin Kowt     Tarin Kowt 32.6333 65.8667  10000 Afghanistan     AF    AFG  Uruzgan
6    Zareh Sharan   Zareh Sharan 32.8500 68.4167  13737 Afghanistan     AF    AFG  Paktika
7        Asadabad       Asadabad 34.8660 71.1500  48400 Afghanistan     AF    AFG    Kunar
8         Taloqan        Taloqan 36.7300 69.5400  64256 Afghanistan     AF    AFG   Takhar
9  Mahmud-E Eraqi Mahmud-E Eraqi 35.0167 69.3333   7407 Afghanistan     AF    AFG   Kapisa
10     Mehtar Lam     Mehtar Lam 34.6500 70.1667  17345 Afghanistan     AF    AFG  Laghman
..            ...            ...     ...     ...    ...         ...    ...    ...      ...

It's hard to demonstrate without any actual data examples (helpful to provide!), but we can make up some toy data.

# make up toy data
> candidate.longlat <- data.frame(vid = 1:3, 
                                lat = c(12.53, -16.31, 42.87), 
                                long = c(-70.03, -48.95, 74.59))

Using the distm function in geosphere, we can calculate the distances between all your data and all the city locations at once. For you, this will make a matrix containing ~8,400,000,000 numbers, so it might take a while (can explore parallisation), and may be highly memory intensive.

> install.packages("geosphere")
> library(geosphere)

# compute distance matrix using geosphere
> distance.matrix <- distm(x = candidate.longlat[,c("long", "lat")], 
                         y = cities.data[,c("lng", "lat")])

It's then easy to find the closest city to each of your data points, and cbind it to your data.frame.

# work out which index in the matrix is closest to the data
> closest.index <- apply(distance.matrix, 1, which.min)

# rbind city and country of match with original query
> candidate.longlat <- cbind(candidate.longlat, cities.data[closest.index, c("city", "country")])
> print(candidate.longlat)

  vid    lat   long       city    country
1   1  12.53 -70.03 Oranjestad      Aruba
2   2 -16.31 -48.95   Anapolis     Brazil
3   3  42.87  74.59    Bishkek Kyrgyzstan