Search code examples
roptimizationhaversine

Optimizing distance distHaversine model for large df in R


I'm working with a large dataset and I'm trying to run geospatial analysis on a local machine with 8GB of RAM. It looks like I have exceeded the resources of my machine and I'm wondering whether I can optimize my model so I can run it on my machine.

area <- data.frame(area = c('Baker Street','Bank'),
                  lat = c(51.522236,51.5134047),
                  lng = c(-0.157080, -0.08905843),
                  radius = c(100,2000)
)

stop <- data.frame(station = c('Angel','Barbican','Barons Court','Bayswater'),
                   lat = c(51.53253,51.520865,51.490281,51.51224),
                   lng = c(-0.10579,-0.097758,-0.214340,-0.187569),
                   postcode = c('EC1V','EC1A', 'W14', 'W2'))



library(geosphere)


datNew = lapply(1:nrow(area), function(i) {

  df = stop

  df$dist = distHaversine(df[,c("lng", "lat")], 
                          area[rep(i,nrow(df)), c('lng','lat')])

  df$in_circle = ifelse(df$dist <= area[i, "radius"], "Yes", "No")

  df$circle_id = area[i, "area"]

  df

})

datNew = do.call(rbind, datNew)

require(dplyr)    
datNew  <- datNew %>% 
  group_by(station) %>% 
  slice(which.min(dist))

Is it possible to calculate the distance and then find the minimum distance in station by station so that I don't end up with multiplying the the number of stations by number of area? Or is there another solution that would allow me to run this in a less resource consuming way or split the jobs so it fits into RAM?


Solution

  • Have you tried putting gc() at the end off the lapply function? It frees the memory space for the next iteration. If this does not help ill try to come back to this answer tommorow, just please reply :)

    EDIT:

    I dont know if you had this in mind but here you go:

    library(geosphere)
    library("plyr")
    library("magrittr")
    
    area <- data.frame(area = c('Baker Street','Bank'),
                       lat = c(51.522236,51.5134047),
                       lng = c(-0.157080, -0.08905843),
                       radius = c(100,2000)
    )
    
    stop <- data.frame(station = c('Angel','Barbican','Barons Court','Bayswater'),
                       lat = c(51.53253,51.520865,51.490281,51.51224),
                       lng = c(-0.10579,-0.097758,-0.214340,-0.187569),
                       postcode = c('EC1V','EC1A', 'W14', 'W2'))
    
    ## In the function below you take an area one by one and then save the station which at the minimal
    ## distance from the given area
    
    min.dist <- ddply(area, ~area, function(xframe){
    
      xframe <<- xframe
      cat("Calculating minimum distance from area...", as.character(xframe$area), "\n")
    
      dists <- distHaversine(xframe[, c("lat", "lng")], stop[ , c("lat", "lng")]) 
      stop.min <- stop[which(min(dists)==dists), ]
      stop.min$area <- xframe$area
      return(stop.min)
      gc()
    
    })
    
    min.dist # the new data frame