I have multiple raster layers (one for each hour; covering contiguous US) and a polygon vector layer. I am extracting the average value for each polygon from the raster layer using terra::extract
. The process works fine for limited number of polygons and raster layers, but as I want to scale this up to multiple months with 1000+ polygons, I was wondering if there is a more optimized method to approach this.
The code snippet below shows a minimal reproducible example along with the desired output. I have thought of two possible routes to take, but have not made it so far implementing them:
I have thought about parallel processing but I am not sure how to achieve that without running into serialization issue here.
Instead of using extract(..., fun = mean)
on every raster, I was thinking of doing something like extract(mrast[[1]], mpoly, cells = TRUE, exact = TRUE)
and getting the cell numbers along with their exact coverage by each polygon. From there, I potentially can subset those cells (all of the raster layers have the same cells) for each hour, multiply their value by their exact fraction, and calculate the averages.
A combination of 2 and 1. I think this would be the fastest method.
#setwd("C:/temp")
library(terra)
library(R.utils)
murl <- "https://mtarchive.geol.iastate.edu/2023/06/27/mrms/ncep/MultiSensor_QPE_01H_Pass2/"
mfiles <- c("MultiSensor_QPE_01H_Pass2_00.00_20230627-070000.grib2.gz",
"MultiSensor_QPE_01H_Pass2_00.00_20230627-080000.grib2.gz")
## download, unzip, read the raster files to R
lapply(seq_along(mfiles), \(i) download.file(paste0(murl, mfiles[i]), mfiles[i]))
lapply(mfiles, \(f) gunzip(f, gsub(".gz", "", f), remove = TRUE, overwrite=TRUE))
mrast <- lapply(list.files(pattern = ".grib2$"), \(f) rast(f))
lapply(seq_along(mrast), \(i) names(mrast[[i]]) <<- time(mrast[[i]]))
## creating a polygon shapefile
mpoly <- vect(dptply, "polygon") ## get dptply at the bottom of the post
crs(mpoly) <- "EPSG:3857"
mpoly <- project(mpoly, "+proj=longlat +datum=WGS84")
## extracting the averages (this part needs optimization)
startTime = Sys.time()
mavg <- lapply(mrast, \(r) {op <- extract(r, mpoly,
weights=TRUE, fun=mean, na.rm = TRUE)
message(time(r))
return(op)})
endTime = Sys.time()
print(endTime - startTime)
2023-06-27 07:00:00
2023-06-27 08:00:00
Time difference of 10.87896 secs
## just for demonstration
cbind(mavg[[1]][1], do.call(cbind, lapply(mavg, "[", 2)))
#> ID 2023-06-27 07:00:00 2023-06-27 08:00:00
#> 1 1 0.000000 10.405161
#> 2 2 0.000000 8.961424
#> 3 3 0.000000 6.300000
#> 4 4 0.000000 5.902914
#> 5 5 0.000000 5.462630
#> 6 6 0.000000 4.100000
#> 7 7 0.000000 3.566511
#> 8 8 0.000000 13.275161
#> 9 9 2.600000 11.200000
#> 10 10 4.633752 16.301403
dptply <- structure(c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3,
4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 8,
8, 8, 8, 8, 8, 9, 9, 9, 9, 9, 10, 10, 10, 10, 10, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, -7945530.9815, -7938378.0621, -7938605.1389,
-7945190.3663, -7945530.9815, -7938264.5237, -7936674.9861, -7930543.9123,
-7932360.5267, -7936674.9861, -7938264.5237, -7938264.5237, -7934698.4679,
-7933886.647, -7933878.7652, -7934666.9409, -7934698.4679, -7933735.8469,
-7932622.0601, -7932634.9736, -7933745.532, -7933735.8469, -7934646.4483,
-7933925.1099, -7933904.9326, -7934631.3153, -7934646.4483, -7933551.8299,
-7932850.6689, -7932865.8018, -7933496.3424, -7933551.8299, -7931645.765,
-7931670.3955, -7932384.6797, -7932470.8864, -7931645.765, -7940928.6868,
-7941271.2217, -7941328.3108, -7948978.2562, -7949434.9693, -7940928.6868,
-7998760.8958, -7999220.4225, -7998966.0077, -7998636.765, -7998760.8958,
-8001395.2248, -8001358.6877, -8003002.8551, -8003149.0033, -8001395.2248,
5262480.6428, 5262635.0714, 5251676.934, 5251831.1854, 5262480.6428,
5262789.5025, 5268041.6687, 5261862.9539, 5252293.955, 5248438.2361,
5248900.8391, 5262789.5025, 5269309.6397, 5269309.6397, 5269052.1707,
5269041.443, 5269309.6397, 5269446.29, 5269441.8957, 5267934.7856,
5267925.9984, 5269446.29, 5271218.9923, 5271198.3903, 5270580.3524,
5270607.8199, 5271218.9923, 5272036.2399, 5272049.9757, 5271507.4245,
5271239.5943, 5272036.2399, 5272266.345, 5270589.7139, 5270556.1843,
5272216.0417, 5272266.345, 5246883.428, 5237972.8, 5237895.3531,
5238282.5936, 5247503.6111, 5246883.428, 5227512.1703, 5227597.3282,
5228266.5923, 5228286.8738, 5227512.1703, 5232252.738, 5229875.3018,
5229528.6423, 5232104.1307, 5232252.738, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0),
dim = c(53L, 5L),
dimnames = list(NULL, c("geom", "part", "x", "y", "hole")))
You pretty much answered yourself, here is how I would approach this using option 2.
First let's find which cells (cell
column) are in which polygons (ID
) and what fraction of them (fraction
). This takes time but will only be performed once.
cells <- terra::extract(mrast[[1]], mpoly, cells=TRUE, exact=TRUE)
head(cells)
# ID 2023-06-27 08:00:00 cell fraction
# 1 1 9.8 8629863 0.2353906
# 2 1 9.6 8629864 0.4162643
# 3 1 9.8 8629865 0.4321375
# 4 1 10.2 8629866 0.4480107
# 5 1 10.3 8629867 0.4638838
# 6 1 10.1 8629868 0.4797569
Using this information, we can extract data for polygons without dealing with their geometry, just using the cell numbers from above. Here is a function which takes a raster (rast
) and the data above cell_data
as its arguments and computes weighted (let's be exact!) means for each polygon:
polygon_means <- function(rast, cell_data) {
# values for cells
ex <- terra::extract(rast, cell_data$cell)
# weighted means by polygons
tapply(ex[, 1] * cell_data$fraction, cell_data$ID, sum) /
tapply(cell_data$fraction, cell_data$ID, sum)
}
Now we can do the computation:
names(mrast) <- sapply(mrast, names)
mavg <- sapply(mrast, polygon_means, cells)
mavg
# 2023-06-27 07:00:00 2023-06-27 08:00:00
# 1 0.000000 10.405166
# 2 0.000000 8.959888
# 3 0.000000 6.300000
# 4 0.000000 5.902433
# 5 0.000000 5.462595
# 6 0.000000 4.100000
# 7 0.000000 3.566484
# 8 0.000000 13.275182
# 9 2.600000 11.200000
# 10 4.633801 16.301409
This takes about 3 seconds on my machine, which is about 10 times faster than your original code (yes, 30 seconds on my laptop).
Thinking about your option 1, i.e. parallelizing, its usefulness would probably depend on your computing environment. Since the SpatRaster
data is not actually in RAM and has to be read from the disk each time it is accessed, on a one-disk machine this might be the main bottleneck which distributing the computation among multiple processor cores would not help solve.
I don't think there will be much practical difference in how you access the raster values. Under the hood, both extract
and [
call some variant of a C++ extracting function (e.g., .extract_cell()
). Compare
selectMethod(terra::extract, c('SpatRaster', 'numeric'))
selectMethod('[', c('SpatRaster', 'numeric', 'numeric'))