I'm working with two RasterStack objects, each one consisting of ten layers representing single time steps.
# Mock data
pred.rst.stck <- do.call("stack", lapply(seq(10), function(i) {
pred.rst <- raster(nrows = 15, ncols = 15, xmn= 0, xmx = 10, ymn = 0, ymx = 10)
pred.rst[] <- rnorm(225, 50, 10)
return(pred.rst)
})
resp.rst.stck <- do.call("stack", lapply(seq(10), function(i) {
resp.rst <- raster(nrows = 10, ncols = 10, xmn = 0, xmx = 10, ymn = 0, ymx = 10)
resp.rst[] <- rnorm(100, 50, 10)
return(resp.rst)
})
pred.rst.stck
serves as set of predictor variables and resp.rst.stck
as set of response variables. For each single cell of the predictor RasterStack, I'd like to fit a linear model on each cell of the response RasterStack, extract the corresponding R-squared per fitted model and sum those up. To cut a long story short, here's my fastest approach so far using R's parallel
package:
# Parallelization
library(parallel)
n.cores <- detectCores()
clstr <- makePSOCKcluster(n.cores)
# Extract cell values from RasterStack objects
pred.vals <- getValues(pred)
resp.vals <- getValues(resp)
clusterExport(clstr, c("pred.vals", "resp.vals"))
# Loop through all predictor cells
rsq.sums <- parLapply(clstr, seq(nrow(pred.vals)), function(i) {
# For each predictor cell, loop through all response cells,
# fit linear model, extract and sum up single R-squared
do.call("sum", lapply(seq(nrow(resp.vals)), function(j) {
summary(lm(resp.vals[j, ] ~ pred.vals[i, ]))$r.squared
}))
})
Although parLapply
performs much better compared to ordinary lapply
, I am wondering if there's an elegant way to speed up the whole procedure. Any suggestions?
Cheers,
Florian
There are a few tricks you can try. I can't quite understand the way you are going about creating linear models, but the r.squared
you are calculating from your linear model is equivalent to the square of Pearson's correlation coefficient (cor
in R with default arguments) which is far quicker to compute than a linear model.
Compare these two functions using your data:
# Finding r.squared using a lm
f1 <- function(n){summary(lm(resp.vals[n,] ~pred.vals[n,]))$r.squared}
# Finding r.squared using Pearson's
f2 <- function(n){ cor(resp.vals[n,],pred.vals[n,])^2}
# Both give the same result
f1(1)
[1] 0.0009032986
f2(1)
[1] 0.0009032986
And in terms of timing a single call of these functions:
require(microbenchmark)
microbenchmark( f1(1) , f2(1) )
Unit: microseconds
expr min lq median uq max neval
f1(1) 2009.328 2037.0680 2071.045 2124.9225 4102.079 100
f2(2) 73.075 77.7365 84.690 89.7215 5485.368 100
So you should be able to cut your processing time 25-fold by using cor
instead of lm
.
A quick system-time comparison exchanging the original function for use of cov()^2
shows this to be the case:
#Using cov()
user system elapsed
0.013 0.002 1.085
#Using lm()
user system elapsed
0.159 0.028 26.334