Search code examples
rr-rasterterra

Terra R - Speed up aggregate() of raster data with custom function


I would like to use aggregate function from the terra R package to aggregate raster with a quantiles approach as aggregation function. Here below, I used the quantile function from R base to compute 50th percentile (i.e. the median) using a raster from the local package directory. I chose the 50th percentile for comparison with median but my goal is indeed to compute other quantile(s)...

library(terra)
# load elevation coming with the terra pakage
r <- rast( system.file("ex/elev.tif", package="terra") )

plot(r)

# number of iteration
n_it <- 20

# with a custom function
start_time <- Sys.time()
for (i in 1:n_it){
ra <- aggregate(r, 2 , fun = function(x) quantile(x, probs = .5, na.rm = T))
}
end_time <- Sys.time()

It took my computer approx. 6 secs to do it 20 times.

print(end_time-start_time)

Time difference of 6.052727 secs

WHen I ran the same aggregate run with the median built-in function, it took approx. 40 times less time to perform the very same 20 iterations!

# with a built-in function
start_time <- Sys.time()
for (i in 1:n_it){
  ra <- aggregate(r, 2 , fun = median)
}
end_time <- Sys.time()
print(end_time-start_time)

Time difference of 0.1456101 secs

As I would like to compute other percentiles than the 50th, could someone provide some advises to speed up aggregate when using custom functions?


Solution

  • Based on the replies, I tested two options: the use of tdigest package and the built-in parallelization routine from terra package (cores parameter). The t-Digest construction algorithm, by Dunning et al., (2019), uses a variant of 1-dimensional k-means clustering to produce a very compact data structure that allows accurate estimation of quantiles. I recommend to use the tquantile function, which reduced by third the processing time with the tested dataset.

    For those who were thinking about foreach parallelization, there are no simple solution to run foreach loop with terra objects. For such tasks, I'm still using the good old raster package. It is a planned update but not on short term - see here. More details below.

    Toy dataset

    library(terra)
    # load elevation coming with the terra pakage
    r <- rast( system.file("ex/elev.tif", package="terra") )
    
    plot(r)
    
    # number of iteration
    n_it <- 20
    
    # With `stats::quantile()` function
    start_time <- Sys.time()
    for (i in 1:n_it){
      ra <- aggregate(r, 2 , fun = function(x) quantile(x, probs = .5, na.rm = T))
    }
    end_time <- Sys.time()
    print(end_time-start_time)
    

    Time difference of 6.013551 secs

    With tdigest::tquantile()

    library(tdigest)
    
    start_time_tdigest <- Sys.time()
    for (i in 1:n_it){
      ra_tdigest <- aggregate(r, 2 , fun = function(x) tquantile(tdigest(na.omit(x)), probs = .5))
      }
    end_time_tdigest <- Sys.time()
    print(end_time_tdigest-start_time_tdigest)
    

    Time difference of 1.922526 secs

    As suspected by Martin, the use of the cores parameter in the terra:aggregate function did not improve the processing time:

    stats::quantile() + parallelization

    start_time_parallel <- Sys.time() 
    for (i in 1:n_it){
    ra_tdigest_parallel <- aggregate(r, 2 , fun = function(x) quantile(x, probs = .5, na.rm = T), cores = 2) 
    } 
    end_time_parallel <- Sys.time() 
    print(end_time_parallel-start_time_parallel)
    

    Time difference of 8.537751 secs

    tdigest::tquantile() + parallelization

    tdigest_quantil_terra <- function(x) {   
    require(tdigest)   
    tquantile(tdigest(na.omit(x)), probs = .5) }
            
    start_time_tdigest_parallel <- Sys.time() for (i in 1:n_it){
    ra_tdigest_parallel <- aggregate(r, 2 , 
    fun = function(x, ff) ff(x), cores = 2 , 
    ff = tdigest_quantil_terra)     
    } 
    
    end_time_tdigest_parallel <- Sys.time() 
    print(end_time_tdigest_parallel-start_time_tdigest_parallel)
    

    Time difference of 7.520231 secs

    In a nutshell:

    1 tdigest 1.922526 secs

    2 base_quantile 6.013551 secs

    3 tdigest_parallel 7.520231 secs

    4 base_quantile_parallel 8.537751 secs