I have the two following datasets:
houses <- data.table(house_number = c(1:3),
lat_decimal = seq(1.1, 1.3, by = 0.1),
lon_decimal = seq(1.4, 1.6, by = 0.1))
stations <- data.table(station_numer = c(1:11),
lat_decimal = seq(1, 2, by = 0.1),
lon_decimal = seq(2, 3, by = 0.1))
I want to merge houses
and stations
together such that the resulting station_number
is the station that's closest to the corresponding house_number
.
This question is very similar, but I'm not sure if they're working with latitude and longitude and also, I don't know how to calculate distances when dealing with longitude and latitude (which is why I prefer to simply use distm
from the geosphere
package).
I have never worked with the outer function. In case the answer from the aforementioned question would work, how can I adapt the answer to use the distm
function instead of the sqrt
function?
Your question is a bit more complicated than a simple merge, and outer
is somewhat ill-suited for the purpose. To be as thorough as possible, we want to calculate the distance between all combinations of houses and stations, then keep only the closest station per house. We'll need two packages:
library(tidyverse)
library(geosphere)
First, a bit of prep. distm
expects coordinates to be ordered as longitude first, latitude second (you have the opposite), so let's fix that, name the columns better, and correct a typo while we're at it:
houses <- data.frame(house_number = c(1:3),
lon_house = seq(1.4, 1.6, by = 0.1),
lat_house = seq(1.1, 1.3, by = 0.1)
)
stations <- data.frame(station_number = c(1:11),
lon_station = seq(2, 3, by = 0.1),
lat_station = seq(1, 2, by = 0.1)
)
We'll create "nested" data frames so that it's easier to keep coordinates together:
house_nest <- nest(houses, -house_number, .key = 'house_coords')
station_nest <- nest(stations, -station_number, .key = 'station_coords')
house_number house_coords
<int> <list>
1 1 <data.frame [1 × 2]>
2 2 <data.frame [1 × 2]>
3 3 <data.frame [1 × 2]>
station_number station_coords
<int> <list>
1 1 <data.frame [1 × 2]>
2 2 <data.frame [1 × 2]>
3 3 <data.frame [1 × 2]>
4 4 <data.frame [1 × 2]>
5 5 <data.frame [1 × 2]>
6 6 <data.frame [1 × 2]>
7 7 <data.frame [1 × 2]>
8 8 <data.frame [1 × 2]>
9 9 <data.frame [1 × 2]>
10 10 <data.frame [1 × 2]>
11 11 <data.frame [1 × 2]>
Use dplyr::crossing
to combine every row from both data frames:
data.master <- crossing(house_nest, station_nest)
house_number house_coords station_number station_coords
<int> <list> <int> <list>
1 1 <data.frame [1 × 2]> 1 <data.frame [1 × 2]>
2 1 <data.frame [1 × 2]> 2 <data.frame [1 × 2]>
3 1 <data.frame [1 × 2]> 3 <data.frame [1 × 2]>
4 1 <data.frame [1 × 2]> 4 <data.frame [1 × 2]>
5 1 <data.frame [1 × 2]> 5 <data.frame [1 × 2]>
6 1 <data.frame [1 × 2]> 6 <data.frame [1 × 2]>
7 1 <data.frame [1 × 2]> 7 <data.frame [1 × 2]>
8 1 <data.frame [1 × 2]> 8 <data.frame [1 × 2]>
9 1 <data.frame [1 × 2]> 9 <data.frame [1 × 2]>
10 1 <data.frame [1 × 2]> 10 <data.frame [1 × 2]>
# ... with 23 more rows
With all this now in place, we can use distm
on each row to calculate a distance, and keep the shortest distance per house:
data.dist <- data.master %>%
mutate(dist = map2_dbl(house_coords, station_coords, distm)) %>%
group_by(house_number) %>%
filter(dist == min(dist))
house_number house_coords station_number station_coords dist
<int> <list> <int> <list> <dbl>
1 1 <data.frame [1 × 2]> 1 <data.frame [1 × 2]> 67690.
2 2 <data.frame [1 × 2]> 1 <data.frame [1 × 2]> 59883.
3 3 <data.frame [1 × 2]> 1 <data.frame [1 × 2]> 55519.