I would like to use aggregate
function from the terra
R
package to aggregate raster with a quantiles approach as aggregation function. Here below, I used the quantile
function from R base
to compute 50th percentile (i.e. the median) using a raster from the local package directory. I chose the 50th percentile for comparison with median but my goal is indeed to compute other quantile(s)...
library(terra)
# load elevation coming with the terra pakage
r <- rast( system.file("ex/elev.tif", package="terra") )
plot(r)
# number of iteration
n_it <- 20
# with a custom function
start_time <- Sys.time()
for (i in 1:n_it){
ra <- aggregate(r, 2 , fun = function(x) quantile(x, probs = .5, na.rm = T))
}
end_time <- Sys.time()
It took my computer approx. 6 secs to do it 20 times.
print(end_time-start_time)
Time difference of 6.052727 secs
WHen I ran the same aggregate
run with the median built-in function, it took approx. 40 times less time to perform the very same 20 iterations!
# with a built-in function
start_time <- Sys.time()
for (i in 1:n_it){
ra <- aggregate(r, 2 , fun = median)
}
end_time <- Sys.time()
print(end_time-start_time)
Time difference of 0.1456101 secs
As I would like to compute other percentiles than the 50th, could someone provide some advises to speed up aggregate
when using custom functions?
Based on the replies, I tested two options: the use of tdigest
package and the built-in parallelization routine from terra
package (cores
parameter). The t-Digest construction algorithm, by Dunning et al., (2019), uses a variant of 1-dimensional k-means clustering to produce a very compact data structure that allows accurate estimation of quantiles. I recommend to use the tquantile
function, which reduced by third the processing time with the tested dataset.
For those who were thinking about foreach
parallelization, there are no simple solution to run foreach loop with terra objects. For such tasks, I'm still using the good old raster package. It is a planned update but not on short term - see here. More details below.
library(terra)
# load elevation coming with the terra pakage
r <- rast( system.file("ex/elev.tif", package="terra") )
plot(r)
# number of iteration
n_it <- 20
# With `stats::quantile()` function
start_time <- Sys.time()
for (i in 1:n_it){
ra <- aggregate(r, 2 , fun = function(x) quantile(x, probs = .5, na.rm = T))
}
end_time <- Sys.time()
print(end_time-start_time)
Time difference of 6.013551 secs
tdigest::tquantile()
library(tdigest)
start_time_tdigest <- Sys.time()
for (i in 1:n_it){
ra_tdigest <- aggregate(r, 2 , fun = function(x) tquantile(tdigest(na.omit(x)), probs = .5))
}
end_time_tdigest <- Sys.time()
print(end_time_tdigest-start_time_tdigest)
Time difference of 1.922526 secs
As suspected by Martin, the use of the cores
parameter in the terra:aggregate
function did not improve the processing time:
stats::quantile()
+ parallelizationstart_time_parallel <- Sys.time()
for (i in 1:n_it){
ra_tdigest_parallel <- aggregate(r, 2 , fun = function(x) quantile(x, probs = .5, na.rm = T), cores = 2)
}
end_time_parallel <- Sys.time()
print(end_time_parallel-start_time_parallel)
Time difference of 8.537751 secs
tdigest::tquantile()
+ parallelizationtdigest_quantil_terra <- function(x) {
require(tdigest)
tquantile(tdigest(na.omit(x)), probs = .5) }
start_time_tdigest_parallel <- Sys.time() for (i in 1:n_it){
ra_tdigest_parallel <- aggregate(r, 2 ,
fun = function(x, ff) ff(x), cores = 2 ,
ff = tdigest_quantil_terra)
}
end_time_tdigest_parallel <- Sys.time()
print(end_time_tdigest_parallel-start_time_tdigest_parallel)
Time difference of 7.520231 secs
In a nutshell:
1 tdigest 1.922526 secs
2 base_quantile 6.013551 secs
3 tdigest_parallel 7.520231 secs
4 base_quantile_parallel 8.537751 secs