I am working on the Gowalla location-based checkin dataset which has around 6.44 million checkins. Unique locations on these checkins are 1.28 million. But Gowalla only gives latitudes and longitudes. So I need to find city, state and country for each of those lats and longs. From another post on StackOverflow I was able to create the R query below which queries the open street maps and finds the relevant geographical details.
Unfortunately it takes around 1 minute to process 125 rows, which means 1.28 million rows would take a couple of days. Is there a faster way to find these details? Maybe there is some package with builtin lats and longs of cities of the world to find the city name for the given lat and long so I do not have to do online querying.
Venue table is a data frame with 3 columns: 1: vid(venueId), 2 lat(latitude), 3: long(longitude)
for(i in 1:nrow(venueTable)) {
#this is just an indicator to display current value of i on screen
cat(paste(".",i,"."))
#Below code composes the url query
url <- paste("http://nominatim.openstreetmap.org/reverse.php? format=json&lat=",
venueTableTest3$lat[i],"&lon=",venueTableTest3$long[i])
url <- gsub(' ','',url)
url <- paste(url)
x <- fromJSON(url)
venueTableTest3$display_name[i] <- x$display_name
venueTableTest3$country[i] <- x$address$country
}
I am using the jsonlite
package in R which makes x
which is the result of the JSON query as a dataframe which stores various results returned. So using x$display_name
or x$address$city
i use my required field.
My laptop is core i5 3230M with 8gb ram and 120gb SSD using Windows 8.
You're going to have issues even if you persevere with time. The service you're querying allows 'an absolute maximum of one request per second', which you're already breaching. It's likely that they will throttle your requests before you reach 1.2m queries. Their website notes similar APIs for larger uses have only around 15k free daily requests.
It'd be much better for you to use an offline option. A quick search shows that there are many freely available datasets of populated places, along with their longitude and latitudes. Here's one we'll use: http://simplemaps.com/resources/world-cities-data
> library(dplyr)
> cities.data <- read.csv("world_cities.csv") %>% tbl_df
> print(cities.data)
Source: local data frame [7,322 x 9]
city city_ascii lat lng pop country iso2 iso3 province
(fctr) (fctr) (dbl) (dbl) (dbl) (fctr) (fctr) (fctr) (fctr)
1 Qal eh-ye Now Qal eh-ye 34.9830 63.1333 2997 Afghanistan AF AFG Badghis
2 Chaghcharan Chaghcharan 34.5167 65.2500 15000 Afghanistan AF AFG Ghor
3 Lashkar Gah Lashkar Gah 31.5830 64.3600 201546 Afghanistan AF AFG Hilmand
4 Zaranj Zaranj 31.1120 61.8870 49851 Afghanistan AF AFG Nimroz
5 Tarin Kowt Tarin Kowt 32.6333 65.8667 10000 Afghanistan AF AFG Uruzgan
6 Zareh Sharan Zareh Sharan 32.8500 68.4167 13737 Afghanistan AF AFG Paktika
7 Asadabad Asadabad 34.8660 71.1500 48400 Afghanistan AF AFG Kunar
8 Taloqan Taloqan 36.7300 69.5400 64256 Afghanistan AF AFG Takhar
9 Mahmud-E Eraqi Mahmud-E Eraqi 35.0167 69.3333 7407 Afghanistan AF AFG Kapisa
10 Mehtar Lam Mehtar Lam 34.6500 70.1667 17345 Afghanistan AF AFG Laghman
.. ... ... ... ... ... ... ... ... ...
It's hard to demonstrate without any actual data examples (helpful to provide!), but we can make up some toy data.
# make up toy data
> candidate.longlat <- data.frame(vid = 1:3,
lat = c(12.53, -16.31, 42.87),
long = c(-70.03, -48.95, 74.59))
Using the distm
function in geosphere
, we can calculate the distances between all your data and all the city locations at once. For you, this will make a matrix
containing ~8,400,000,000 numbers, so it might take a while (can explore parallisation), and may be highly memory intensive.
> install.packages("geosphere")
> library(geosphere)
# compute distance matrix using geosphere
> distance.matrix <- distm(x = candidate.longlat[,c("long", "lat")],
y = cities.data[,c("lng", "lat")])
It's then easy to find the closest city to each of your data points, and cbind
it to your data.frame
.
# work out which index in the matrix is closest to the data
> closest.index <- apply(distance.matrix, 1, which.min)
# rbind city and country of match with original query
> candidate.longlat <- cbind(candidate.longlat, cities.data[closest.index, c("city", "country")])
> print(candidate.longlat)
vid lat long city country
1 1 12.53 -70.03 Oranjestad Aruba
2 2 -16.31 -48.95 Anapolis Brazil
3 3 42.87 74.59 Bishkek Kyrgyzstan