I am trying to find the ID of the closest LAT_LON in a data.frame with reference to my original data.frame. I have already figured this out by merging both data.frames on a unique identifier and the calculating the distance based on the distHaverSine
function from geosphere
. Now, I want to take step further and join the data.frames without the unique identifier and find ID the nearest LAT-LON.
I have used the following code after merging:
v3 <-v2 %>% mutate(CTD = distHaversine(cbind(LON.x, LAT.x), cbind(LON.y, LAT.y)))
DATA:
loc <- data.frame(station = c('Baker Street','Bank'),
lat = c(51.522236,51.5134047),
lng = c(-0.157080, -0.08905843),
postcode = c('NW1','EC3V'))
stop <- data.frame(station = c('Angel','Barbican','Barons Court','Bayswater'),
lat = c(51.53253,51.520865,51.490281,51.51224),
lng = c(-0.10579,-0.097758,-0.214340,-0.187569),
postcode = c('EC1V','EC1A', 'W14', 'W2'))
As a final result I would like something like this:
df <- data.frame(loc = c('Baker Street','Bank','Baker Street','Bank','Baker Street','Bank','Baker
Street','Bank'),
stop = c('Angel','Barbican','Barons Court','Bayswater','Angel','Barbican','Barons Court','Bayswater'),
dist = c('x','x','x','x','x','x','x','x'),
lat = c(51.53253,51.520865,51.490281,51.51224,51.53253,51.520865,51.490281,51.51224),
lng = c(-0.10579,-0.097758,-0.214340,-0.187569,-0.10579,-0.097758,-0.214340,-0.187569),
postcode = c('EC1V','EC1A', 'W14', 'W2','EC1V','EC1A', 'W14', 'W2')
)
Any help is appreciated. Thanks.
As the distances between the object are small we can speed up the computation by using the euclidian distance between the coordinates. As we are not around the equator, the lng coordinates are squished a bit; we can make the comparison slightly better by scaling the lng a bit.
cor_stop <- stop[, c("lat", "lng")]
cor_stop$lng <- cor_stop$lng * sin(mean(cor_stop$lat, na.rm = TRUE)/180*pi)
cor_loc <- loc[, c("lat", "lng")]
cor_loc$lng <- cor_loc$lng * sin(mean(cor_loc$lat, na.rm = TRUE)/180*pi)
We can then calculate the closest stop for each location using the FNN
package which uses tree based search to quickly find the closest K neighbours. This should scale to big data sets (I have used this for datasets with millions of records):
library(FNN)
matches <- knnx.index(cor_stop, cor_loc, k = 1)
matches
## [,1]
## [1,] 4
## [2,] 2
We can then construct the end result:
res <- loc
res$stop_station <- stop$station[matches[,1]]
res$stop_lat <- stop$lat[matches[,1]]
res$stop_lng <- stop$lng[matches[,1]]
res$stop_postcode <- stop$postcode[matches[,1]]
And calculate the actual distance:
library(geosphere)
res$dist <- distHaversine(res[, c("lng", "lat")], res[, c("stop_lng", "stop_lat")])
res
## station lat lng postcode stop_station stop_lat stop_lng
## 1 Baker Street 51.52224 -0.15708000 NW1 Bayswater 51.51224 -0.187569
## 2 Bank 51.51340 -0.08905843 EC3V Barbican 51.52087 -0.097758
## stop_postcode dist
## 1 W2 2387.231
## 2 EC1A 1026.091
I you are unsure that the closest point in lat-long is also the closest point 'as the bird flies', you could use this method to first select the K closest points in lat-long; then calculate the distances for those points and then selecting the closest point.