Search code examples
rdplyrleft-joinspatial

left_join based on closest LAT_LON in R


I am trying to find the ID of the closest LAT_LON in a data.frame with reference to my original data.frame. I have already figured this out by merging both data.frames on a unique identifier and the calculating the distance based on the distHaverSine function from geosphere. Now, I want to take step further and join the data.frames without the unique identifier and find ID the nearest LAT-LON. I have used the following code after merging:

v3 <-v2 %>% mutate(CTD = distHaversine(cbind(LON.x, LAT.x), cbind(LON.y, LAT.y)))

DATA:

loc <- data.frame(station = c('Baker Street','Bank'),
     lat = c(51.522236,51.5134047),
     lng = c(-0.157080, -0.08905843),
               postcode = c('NW1','EC3V'))
stop <- data.frame(station = c('Angel','Barbican','Barons Court','Bayswater'),
                lat = c(51.53253,51.520865,51.490281,51.51224),
                lng = c(-0.10579,-0.097758,-0.214340,-0.187569),
                postcode = c('EC1V','EC1A', 'W14', 'W2'))

As a final result I would like something like this:

df <- data.frame(loc = c('Baker Street','Bank','Baker Street','Bank','Baker Street','Bank','Baker 
        Street','Bank'), 
              stop = c('Angel','Barbican','Barons Court','Bayswater','Angel','Barbican','Barons Court','Bayswater'), 
              dist = c('x','x','x','x','x','x','x','x'), 
              lat = c(51.53253,51.520865,51.490281,51.51224,51.53253,51.520865,51.490281,51.51224), 
              lng = c(-0.10579,-0.097758,-0.214340,-0.187569,-0.10579,-0.097758,-0.214340,-0.187569),
              postcode = c('EC1V','EC1A', 'W14', 'W2','EC1V','EC1A', 'W14', 'W2')
              )

Any help is appreciated. Thanks.


Solution

  • As the distances between the object are small we can speed up the computation by using the euclidian distance between the coordinates. As we are not around the equator, the lng coordinates are squished a bit; we can make the comparison slightly better by scaling the lng a bit.

    cor_stop <- stop[, c("lat", "lng")]
    cor_stop$lng <- cor_stop$lng * sin(mean(cor_stop$lat, na.rm = TRUE)/180*pi)
    cor_loc <- loc[, c("lat", "lng")]
    cor_loc$lng <- cor_loc$lng * sin(mean(cor_loc$lat, na.rm = TRUE)/180*pi)
    

    We can then calculate the closest stop for each location using the FNN package which uses tree based search to quickly find the closest K neighbours. This should scale to big data sets (I have used this for datasets with millions of records):

    library(FNN)
    matches <- knnx.index(cor_stop, cor_loc, k = 1)
    matches
    
    ##      [,1]
    ## [1,]    4
    ## [2,]    2
    

    We can then construct the end result:

    res <- loc
    res$stop_station  <- stop$station[matches[,1]]
    res$stop_lat      <- stop$lat[matches[,1]]
    res$stop_lng      <- stop$lng[matches[,1]]
    res$stop_postcode <- stop$postcode[matches[,1]]
    

    And calculate the actual distance:

    library(geosphere)
    res$dist <- distHaversine(res[, c("lng", "lat")], res[, c("stop_lng", "stop_lat")])
    res
    
    ##          station      lat         lng postcode stop_station stop_lat  stop_lng
    ## 1 Baker Street 51.52224 -0.15708000      NW1    Bayswater 51.51224 -0.187569
    ## 2         Bank 51.51340 -0.08905843     EC3V     Barbican 51.52087 -0.097758
    ##   stop_postcode     dist
    ## 1            W2 2387.231
    ## 2          EC1A 1026.091
    

    I you are unsure that the closest point in lat-long is also the closest point 'as the bird flies', you could use this method to first select the K closest points in lat-long; then calculate the distances for those points and then selecting the closest point.