Search code examples
rdistancelarge-datageosphere

Number of points within a radius with large datasets- R


I have county parcel level shapefiles and I'm aiming to calculate the number of parcels within a mile (about 1610 meters) in total, as well as with the same owner. I've worked through a solution, and below is my sample code, but it is fairly inefficient and memory intensive. I cannot publicly post the data, but here is the problem with some made up code:

library(rgdal)
library(rgeos)
library(geosphere)


nobs<-1000  # number of observations
nowners<-50 # number of different owners
crs<-"+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0"
long<-runif(nobs,min=-94.70073, max=-94.24141) #roughly adair county in iowa
lat<-runif(nobs,min=41.15712,max=41.50415) #roughly adair county in iowa
coords<-cbind(long,lat)
owner<-sample(1:nowners,nobs, replace=T) # give id's to owners 
df<-as.data.frame(owner)
centroids<-SpatialPointsDataFrame(coords,df,proj4string = CRS(crs)) # make spatial dataframe 

d<-distm(centroids) # distance from centroids to other centroid

numdif<-matrix(0,length(owner)) #vectors of 0s to be replaced in loop
numtot<-matrix(0,length(owner))
for (i in 1:length(owner)) {
  same_id<-df$owner[i]==owner ## identify locations with same owner ID 
  numdif[i]<-as.numeric(sum(d[i,]<1609.34 & same_id==F)) #different parcel owners
  numtot[i]<-as.numeric(sum(d[i,]<1609.34)) #total parcels
}

The resulting "numdif" and "numtot" vectors give me what I want: A vector of number of neighboring parcels with different owners and total, respectively. However, this process is incredibly time consuming and memory intensive for my counties that have a far larger "nobs." Some counties have 50-75,000 observations (so the resulting matrix m has billions of elements and will likely require far more memory than I have). Does anyone have thoughts about a better way of approaching this problem, from both a speed and memory perspective? Help is greatly appreciated.


Solution

  • You could have done the counts in an apply

    d <- d < 1609.34
    nt <- apply(d, 1, sum)
    nd <- apply(d, 1, function(i) length(unique(owner[i]))) - 1
    

    I think that your computation of numdif is not correct, as it includes other owners multiple times if they have multiple parcels.

    Given the large number of observations, I would consider this route:

    d <- lapply(1:nrow(coords), function(i) which(distGeo(coords[i, ,drop=FALSE], coords) < 1609.34))
    ntot <- sapply(d, length)
    ndif <- sapply(d, function(i) length(unique(owner[i]))) - 1
    

    That is slower, but it won't create a crazy large matrix

    I should also add that your approach assumes that the parcels are small relatively to the distance considered, such that using the centroid is OK. If this is not the case, it is possible to do the computations on the polygons with rgeos::gWithinDistance, at increased computational cost.