I'm working with a large dataset and I'm trying to run geospatial analysis on a local machine with 8GB of RAM. It looks like I have exceeded the resources of my machine and I'm wondering whether I can optimize my model so I can run it on my machine.
area <- data.frame(area = c('Baker Street','Bank'),
lat = c(51.522236,51.5134047),
lng = c(-0.157080, -0.08905843),
radius = c(100,2000)
)
stop <- data.frame(station = c('Angel','Barbican','Barons Court','Bayswater'),
lat = c(51.53253,51.520865,51.490281,51.51224),
lng = c(-0.10579,-0.097758,-0.214340,-0.187569),
postcode = c('EC1V','EC1A', 'W14', 'W2'))
library(geosphere)
datNew = lapply(1:nrow(area), function(i) {
df = stop
df$dist = distHaversine(df[,c("lng", "lat")],
area[rep(i,nrow(df)), c('lng','lat')])
df$in_circle = ifelse(df$dist <= area[i, "radius"], "Yes", "No")
df$circle_id = area[i, "area"]
df
})
datNew = do.call(rbind, datNew)
require(dplyr)
datNew <- datNew %>%
group_by(station) %>%
slice(which.min(dist))
Is it possible to calculate the distance and then find the minimum distance in station
by station
so that I don't end up with multiplying the the number of stations
by number of area
? Or is there another solution that would allow me to run this in a less resource consuming way or split the jobs so it fits into RAM?
Have you tried putting gc() at the end off the lapply function? It frees the memory space for the next iteration. If this does not help ill try to come back to this answer tommorow, just please reply :)
EDIT:
I dont know if you had this in mind but here you go:
library(geosphere)
library("plyr")
library("magrittr")
area <- data.frame(area = c('Baker Street','Bank'),
lat = c(51.522236,51.5134047),
lng = c(-0.157080, -0.08905843),
radius = c(100,2000)
)
stop <- data.frame(station = c('Angel','Barbican','Barons Court','Bayswater'),
lat = c(51.53253,51.520865,51.490281,51.51224),
lng = c(-0.10579,-0.097758,-0.214340,-0.187569),
postcode = c('EC1V','EC1A', 'W14', 'W2'))
## In the function below you take an area one by one and then save the station which at the minimal
## distance from the given area
min.dist <- ddply(area, ~area, function(xframe){
xframe <<- xframe
cat("Calculating minimum distance from area...", as.character(xframe$area), "\n")
dists <- distHaversine(xframe[, c("lat", "lng")], stop[ , c("lat", "lng")])
stop.min <- stop[which(min(dists)==dists), ]
stop.min$area <- xframe$area
return(stop.min)
gc()
})
min.dist # the new data frame