Just imagine you have rather large dataset of 2.000.000 points randomly destributed over some polygonal space. Density function has to be measured any time from randomly chosen sample of 4.000 points. This process has to be repeated 200 times. My code doesn't cover this problem very well. Any suggestion how to improve the code.
# coord is SpatialPoints Object
library(sp)
library(maptools)
library(map)
You can get the polygonial object from the following link: https://www.dropbox.com/sh/65c3rke0gi4d8pb/LAKJWhwm-l
germG <- readShapePoly("vg250_gem.shp")
coord <- spsample(germG, 2e06, "random") # this command needs some minutes to be done.
# R is the number of simulations
R <- 200
M <- matrix(NA,R, 256)
ptm=proc.time()
for(r in 1:R) {
ix <- sample(1:2e06,size=4000)
Dg <- spDists(coord[ix])
Dg <- as.vector(Dg[Dg!=0])
kg <- density(Dg,bw="nrd0",n=256)
M[r,] <- kg$y
}
runningtime = proc.time()-ptm
cat("total run time (sec) =",round(runningtime[3],1),"\n")
the upper code provides a total run time (sec) = 964.8 by using Core i3, 2.27Ghz, 4 processors and 4 Gb RAM. How to speed up the performance of this for-loop simulation? I'll be very thankful to all your comments, critics and suggestions.
Not really an answer, just some observations:
spDists(...)
is Euclidean. If this is what you want, then you are much better off using the dist(..)
function - it is much more efficient (see code below). If you want geographic distance (e.g., Great Circle) you have to use spDists(..., longlat=T)
.spDists(...)
calculates the full distance matrix, not just the lower diagonal. This means that all non-zero distances show up twice, which has an effect on your densities. This is why M1 and M2 in the code below are not equal.Rprof
) shows that about 46% of the time is spent in density(...)
, and another 38% is spent in spDists(...)
. So this is one case where replacing the for loop with calls to apply, lapply etc. will not help much. earth.dist(...)
in the fossil
package, distm(...)
from the geosphere
package, and rdist.earth(...)
in the fields
package, but none of these improved run times.Code:
library(sp)
library(maptools)
germG <- readShapePoly("vg250_gem.shp")
coord <- spsample(germG, 1e4, "random") # Just 10,000 points...
R <- 200
# dist(...) and spDists(..., longlat=F) give same result
zz <- coord[sample(1e4,size=200)]
d1 <- spDists(zz)
d2 <- dist(zz@coords)
max(abs(as.matrix(d1)-as.matrix(d2)))
# [1] 0
# but dist(...) is much faster
M1 <- matrix(NA,R, 256)
set.seed(1)
system.time({
for(r in 1:R) {
ix <- sample(1e4,size=200) # S = 200; test case
Dg <- spDists(coord[ix]) # using spDists(...)
Dg <- as.vector(Dg[Dg!=0])
kg <- density(Dg,bw="nrd0",n=256)
M1[r,] <- kg$y
}
})
# user system elapsed
# 11.08 0.17 11.28
M2 <- matrix(NA,R, 256)
set.seed(1)
system.time({
for(r in 1:R) {
ix <- sample(1e4,size=200) # S = 200; test case
Dg <- dist(coord[ix]@coords) # using dist(...)
Dg <- as.vector(Dg[Dg!=0])
kg <- density(Dg,bw="nrd0",n=256)
M2[r,] <- kg$y
}
})
# user system elapsed
# 2.67 0.03 2.73
Edit In response to OP's request. Below is the profiling code with size=200.
R=200
M <- matrix(NA,R, 256)
Rprof("profile")
set.seed(1)
system.time({
for(r in 1:R) {
ix <- sample(1e4,size=200) # S = 200; test case
Dg <- spDists(coord[ix]) # using spDists(...)
Dg <- as.vector(Dg[Dg!=0])
kg <- density(Dg,bw="nrd0",n=256)
M[r,] <- kg$y
}
})
Rprof(NULL)
head(summaryRprof("profile")$by.total,10)
# total.time total.pct self.time self.pct
# "system.time" 11.52 100.00 0.02 0.17
# "spDists" 7.08 61.46 0.02 0.17
# "matrix" 6.76 58.68 0.24 2.08
# "apply" 6.58 57.12 0.26 2.26
# "FUN" 5.88 51.04 0.22 1.91
# "spDistsN1" 5.66 49.13 3.36 29.17
# "density" 3.18 27.60 0.02 0.17
# "density.default" 3.16 27.43 0.06 0.52
# "bw.nrd0" 1.98 17.19 0.00 0.00
# "quantile" 1.76 15.28 0.02 0.17
As S gets bigger, calculating density begins to dominate because the results must be sorted. You can run this code with ix <- sample(1e4,size=4000)
to see it.