Search code examples
rparallel-processingraster

Speed-optimized loop through single cells of a RasterStack object


I'm working with two RasterStack objects, each one consisting of ten layers representing single time steps.

# Mock data
pred.rst.stck <- do.call("stack", lapply(seq(10), function(i) {
  pred.rst <- raster(nrows = 15, ncols = 15, xmn= 0, xmx = 10, ymn = 0, ymx = 10)
  pred.rst[] <- rnorm(225, 50, 10)
  return(pred.rst)
})
resp.rst.stck <- do.call("stack", lapply(seq(10), function(i) {
  resp.rst <- raster(nrows = 10, ncols = 10, xmn = 0, xmx = 10, ymn = 0, ymx = 10)
  resp.rst[] <- rnorm(100, 50, 10)
  return(resp.rst)
})

pred.rst.stck serves as set of predictor variables and resp.rst.stck as set of response variables. For each single cell of the predictor RasterStack, I'd like to fit a linear model on each cell of the response RasterStack, extract the corresponding R-squared per fitted model and sum those up. To cut a long story short, here's my fastest approach so far using R's parallel package:

# Parallelization
library(parallel)
n.cores <- detectCores()
clstr <- makePSOCKcluster(n.cores)

# Extract cell values from RasterStack objects
pred.vals <- getValues(pred)
resp.vals <- getValues(resp)
clusterExport(clstr, c("pred.vals", "resp.vals"))

# Loop through all predictor cells
rsq.sums <- parLapply(clstr, seq(nrow(pred.vals)), function(i) {
  # For each predictor cell, loop through all response cells, 
  # fit linear model, extract and sum up single R-squared 
  do.call("sum", lapply(seq(nrow(resp.vals)), function(j) {
    summary(lm(resp.vals[j, ] ~ pred.vals[i, ]))$r.squared
  }))
})

Although parLapply performs much better compared to ordinary lapply, I am wondering if there's an elegant way to speed up the whole procedure. Any suggestions?

Cheers,
Florian


Solution

  • There are a few tricks you can try. I can't quite understand the way you are going about creating linear models, but the r.squared you are calculating from your linear model is equivalent to the square of Pearson's correlation coefficient (cor in R with default arguments) which is far quicker to compute than a linear model.

    Compare these two functions using your data:

    # Finding r.squared using a lm
    f1 <- function(n){summary(lm(resp.vals[n,] ~pred.vals[n,]))$r.squared}
    # Finding r.squared using Pearson's
    f2 <- function(n){ cor(resp.vals[n,],pred.vals[n,])^2}
    
    # Both give the same result
    f1(1)
    [1] 0.0009032986
    f2(1)
    [1] 0.0009032986
    

    And in terms of timing a single call of these functions:

    require(microbenchmark)
    microbenchmark( f1(1) , f2(1) )
    Unit: microseconds
      expr      min        lq   median        uq      max neval
     f1(1) 2009.328 2037.0680 2071.045 2124.9225 4102.079   100
     f2(2)   73.075   77.7365   84.690   89.7215 5485.368   100
    

    So you should be able to cut your processing time 25-fold by using cor instead of lm.

    A quick system-time comparison exchanging the original function for use of cov()^2 shows this to be the case:

    #Using cov()
       user  system elapsed 
      0.013   0.002   1.085
    
    #Using lm()
       user  system elapsed 
      0.159   0.028  26.334