r algorithm performance geolocation distance

How to calculate distance between 2 coordinates below a certain threshold in R?

I have 44,000 US Zip codes and it's corresponding centroid lat/long in R. This is from the package 'zipcode' in R. I need to calculate the distance between each zipcode and keep those distances that are less than 5 miles. The problem is to calculate all distances between the zipcodes I have to create a vector of size 44,000x44,0000 which I can't due to space issues.

I checked through the posts in R, the closest to my requirement is one that spits out the minimum distance between 2 datasets with lat/long

DB1 <- data.frame(location_id=1:7000,LATITUDE=runif(7000,min = -90,max = 90),LONGITUDE=runif(7000,min = -180,max = 180))
DB2 <- data.frame(location_id=7001:12000,LATITUDE=runif(5000,min = -90,max = 90),LONGITUDE=runif(5000,min = -180,max = 180))

DistFun <- function(ID){
  TMP <- DB1[DB1$location_id==ID,]
  TMP1 <- distGeo(TMP[,3:2],DB2[,3:2])
  TMP2 <- data.frame(DB1ID=ID,DB2ID=DB2[which.min(TMP1),1],DistanceBetween=min(TMP1)      ) 
  print(ID)
  return(TMP2)
}

DistanceMatrix <- rbind_all(lapply(DB1$location_id, DistFun))

Even if we can modify the above code to incorporate all distances <= 5 miles (for eg), it is extremely slow in execution.

Is there an efficient way to arrive at all zip code combinations that are <=5 miles from each others centroids?

Solution

Generating the whole distance matrix at a time will be very RAM consuming, looping over each combination of unique zipcodes - very time consuming. Lets find some compromise.

I suggest chunking the zipcode data.frame into pieces of (for example) 100 rows (with the help of chunk function from package bit), then calculating distances between 44336 and 100 points, filtering according to the target distance treshold and then moving on to the next data chunk. In my example I convert zipcode data into data.table to gain some speed and save RAM.

library(zipcode)
library(data.table)
library(magrittr)
library(geosphere)

data(zipcode)

setDT(zipcode)
zipcode[, dum := NA] # we'll need it for full outer join

Just for information - that's the approximate size of each piece of data in RAM.

merge(zipcode, zipcode[1:100], by = "dum", allow.cartesian = T) %>% 
  object.size() %>% print(unit = "Mb")
# 358.2 Mb

The code itself.

lapply(bit::chunk(1, nrow(zipcode), 1e2), function(ridx) {
  merge(zipcode, zipcode[ridx[1]:ridx[2]], by = "dum", allow.cartesian = T)[
    , dist := distGeo(matrix(c(longitude.x, latitude.x), ncol = 2), 
                      matrix(c(longitude.y, latitude.y), ncol = 2))/1609.34 # meters to miles
    ][dist <= 5 # necessary distance treshold
      ][, dum := NULL]
  }) %>% rbindlist -> zip_nearby_dt

zip_nearby_dt # not the whole! for first 10 chunks only

       zip.x          city.x state.x latitude.x longitude.x zip.y     city.y state.y latitude.y longitude.y     dist
    1: 00210      Portsmouth      NH   43.00590   -71.01320 00210 Portsmouth      NH   43.00590   -71.01320 0.000000
    2: 00210      Portsmouth      NH   43.00590   -71.01320 00211 Portsmouth      NH   43.00590   -71.01320 0.000000
    3: 00210      Portsmouth      NH   43.00590   -71.01320 00212 Portsmouth      NH   43.00590   -71.01320 0.000000
    4: 00210      Portsmouth      NH   43.00590   -71.01320 00213 Portsmouth      NH   43.00590   -71.01320 0.000000
    5: 00210      Portsmouth      NH   43.00590   -71.01320 00214 Portsmouth      NH   43.00590   -71.01320 0.000000
---                                                                                                              
15252: 02906      Providence      RI   41.83635   -71.39427 02771    Seekonk      MA   41.84345   -71.32343 3.688747
15253: 02912      Providence      RI   41.82674   -71.39770 02771    Seekonk      MA   41.84345   -71.32343 4.003095
15254: 02914 East Providence      RI   41.81240   -71.36834 02771    Seekonk      MA   41.84345   -71.32343 3.156966
15255: 02916         Rumford      RI   41.84325   -71.35391 02769   Rehoboth      MA   41.83507   -71.26115 4.820599
15256: 02916         Rumford      RI   41.84325   -71.35391 02771    Seekonk      MA   41.84345   -71.32343 1.573050

On my machine it took 1.7 minutes to process 10 chunks, so the whole processing may take 70-80 minutes, not fast, but may be satisfying. We can increase the chunk size to 200 or 300 rows depending on available RAM volume, this will shorten the processing time 2 or 3 times respectively.

The drawback of this solution is that the resulting data.table contains "duplicated" rows - I mean there are both distances from point A to point B, and from B to A. This may need some additional filtering.