Search code examples
rgeocodingdata-cleaningshortest-pathgeosphere

Geocoding: Efficient way to find the distance between two sets of locations


I have a set of coordinates of the locations of different individuals, and another set of coordinates of different drop off boxes, for their ballots. I'm trying to find the distance between their residence, and the nearest dropbox. I've attached a copy of the code I have to work through that as of now--it was replicated from another stack overflow example. However, it is not too efficient, as the dataset I'm working with is millions of rows, and the code relies on finding all possible combinations of coordinates, and then pulling the least distance. Is there a more efficient way to deal with this?

What I currently have:

# Made-Up Data
library(geosphere)
library(tidyverse)
geo_voters <- data.frame(voter_id = c(12345, 45678, 89011)
                    long=c(-43.17536, -43.17411, -43.36605),
                     lat=c(-22.95414, -22.9302, -23.00133))

geo_dropoff_boxes <- data.frame(long=c(-43.19155, -43.33636, -67.45666),
                      lat=c(-22.90353, -22.87253, -26,78901))
# Code to find the distance between voters, and the dropoff boxes
# Order into a newdf as needed first.
# First, the voters:  
voter_addresses <- data.frame(voter_id = as.character(geo_voters$voter_id),
                              lon_address = geo_voters$long,
                              lat_address = geo_voters$lat
                              )
# Second, the polling locations: 
polling_address <- data.frame(place_number = 1:nrow(geo_dropoff_boxes),
                       lon_place = geo_dropoff_boxes$long,
                       lat_place = geo_dropoff_boxes$lat
                       )

# Create nested dfs: 
voter_nest <- nest(voter_addresses, -voter_id, .key = 'voter_coords')
polling_nest <- nest(polling_address, -place_number, .key = 'polling_coords')

# Combine for combinations: 
data_master <- crossing(voter_nest, polling_nest)

# Calculate shortest distance: 
shortest_dist <- data_master %>% 
  mutate(dist = map2_dbl(voter_coords, polling_coords, distm)) %>% 
  group_by(voter_id) %>% 
  filter(dist == min(dist)) %>%
  mutate(dist_km = dist/1000,
         voter_id = as.character(voter_id)) %>%
  select(voter_id, dist_km)

Solution

  • The sf package makes this simple. The st_as_sf() function converts data frame of lat-long values to georeferenced points, and the st_distance() function calculates the distances between them. When running st_as_sf(), you'll need to specify a coordinate reference system. It looks like you're using latitude and longitude, so I specify crs="epsg:4326", which is the most common latitude/longitude reference.

    library( sf )
    
    geo_voters <- data.frame(voter_id = c(12345, 45678, 89011)
                        long=c(-43.17536, -43.17411, -43.36605),
                         lat=c(-22.95414, -22.9302, -23.00133))
    
    geo_dropoff_boxes <- data.frame(long=c(-43.19155, -43.33636, -67.45666),
                          lat=c(-22.90353, -22.87253, -26.78901))
    
    # convert the data to sf features
    geo_voters = st_as_sf( geo_voters, coords=c('long', 'lat'), crs="epsg:4326" )
    geo_dropoff_boxes = st_as_sf( geo_dropoff_boxes, coords=c('long', 'lat'), crs="epsg:4326" )
    
    # calculate the distances between voters and drop boxes
    dist = st_distance( geo_voters, geo_dropoff_boxes )
    print(dist)
    

    Now each row represents a voter and each column represents their distance to a drop box (in meters):

    Units: [m]
              [,1]     [,2]    [,3]
    [1,]  5866.745 18821.87 2482400
    [2,]  3461.945 17813.57 2483210
    [3,] 20916.618 14641.09 2462186