Optimizing extracting values from a raster per polygon

I have multiple raster layers (one for each hour; covering contiguous US) and a polygon vector layer. I am extracting the average value for each polygon from the raster layer using terra::extract. The process works fine for limited number of polygons and raster layers, but as I want to scale this up to multiple months with 1000+ polygons, I was wondering if there is a more optimized method to approach this.

The code snippet below shows a minimal reproducible example along with the desired output. I have thought of two possible routes to take, but have not made it so far implementing them:

I have thought about parallel processing but I am not sure how to achieve that without running into serialization issue here.
Instead of using extract(..., fun = mean) on every raster, I was thinking of doing something like extract(mrast[[1]], mpoly, cells = TRUE, exact = TRUE) and getting the cell numbers along with their exact coverage by each polygon. From there, I potentially can subset those cells (all of the raster layers have the same cells) for each hour, multiply their value by their exact fraction, and calculate the averages.
A combination of 2 and 1. I think this would be the fastest method.

#setwd("C:/temp")
library(terra)
library(R.utils)

murl <- "https://mtarchive.geol.iastate.edu/2023/06/27/mrms/ncep/MultiSensor_QPE_01H_Pass2/"
mfiles <- c("MultiSensor_QPE_01H_Pass2_00.00_20230627-070000.grib2.gz", 
           "MultiSensor_QPE_01H_Pass2_00.00_20230627-080000.grib2.gz")

## download, unzip, read the raster files to R
lapply(seq_along(mfiles), \(i) download.file(paste0(murl, mfiles[i]), mfiles[i])) 
lapply(mfiles, \(f) gunzip(f, gsub(".gz", "", f), remove = TRUE, overwrite=TRUE))

mrast <- lapply(list.files(pattern = ".grib2$"), \(f) rast(f)) 
lapply(seq_along(mrast), \(i) names(mrast[[i]]) <<- time(mrast[[i]]))

## creating a polygon shapefile
mpoly <- vect(dptply, "polygon") ## get dptply at the bottom of the post
crs(mpoly) <- "EPSG:3857"
mpoly <- project(mpoly, "+proj=longlat +datum=WGS84")

## extracting the averages (this part needs optimization)
startTime = Sys.time()
mavg <- lapply(mrast, \(r) {op <- extract(r, mpoly, 
                                         weights=TRUE, fun=mean, na.rm = TRUE)
                            message(time(r))
                            return(op)})
endTime = Sys.time()
print(endTime - startTime)

2023-06-27 07:00:00
2023-06-27 08:00:00
Time difference of 10.87896 secs

## just for demonstration
cbind(mavg[[1]][1], do.call(cbind, lapply(mavg, "[", 2)))
#>    ID 2023-06-27 07:00:00 2023-06-27 08:00:00
#> 1   1            0.000000           10.405161
#> 2   2            0.000000            8.961424
#> 3   3            0.000000            6.300000
#> 4   4            0.000000            5.902914
#> 5   5            0.000000            5.462630
#> 6   6            0.000000            4.100000
#> 7   7            0.000000            3.566511
#> 8   8            0.000000           13.275161
#> 9   9            2.600000           11.200000
#> 10 10            4.633752           16.301403

Data:

dptply <- structure(c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 
                      4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 8, 
                      8, 8, 8, 8, 8, 9, 9, 9, 9, 9, 10, 10, 10, 10, 10, 1, 1, 1, 1, 
                      1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
                      1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
                      1, 1, 1, 1, 1, 1, 1, -7945530.9815, -7938378.0621, -7938605.1389, 
                      -7945190.3663, -7945530.9815, -7938264.5237, -7936674.9861, -7930543.9123, 
                      -7932360.5267, -7936674.9861, -7938264.5237, -7938264.5237, -7934698.4679, 
                      -7933886.647, -7933878.7652, -7934666.9409, -7934698.4679, -7933735.8469, 
                      -7932622.0601, -7932634.9736, -7933745.532, -7933735.8469, -7934646.4483, 
                      -7933925.1099, -7933904.9326, -7934631.3153, -7934646.4483, -7933551.8299, 
                      -7932850.6689, -7932865.8018, -7933496.3424, -7933551.8299, -7931645.765, 
                      -7931670.3955, -7932384.6797, -7932470.8864, -7931645.765, -7940928.6868, 
                      -7941271.2217, -7941328.3108, -7948978.2562, -7949434.9693, -7940928.6868, 
                      -7998760.8958, -7999220.4225, -7998966.0077, -7998636.765, -7998760.8958, 
                      -8001395.2248, -8001358.6877, -8003002.8551, -8003149.0033, -8001395.2248, 
                      5262480.6428, 5262635.0714, 5251676.934, 5251831.1854, 5262480.6428, 
                      5262789.5025, 5268041.6687, 5261862.9539, 5252293.955, 5248438.2361, 
                      5248900.8391, 5262789.5025, 5269309.6397, 5269309.6397, 5269052.1707, 
                      5269041.443, 5269309.6397, 5269446.29, 5269441.8957, 5267934.7856, 
                      5267925.9984, 5269446.29, 5271218.9923, 5271198.3903, 5270580.3524, 
                      5270607.8199, 5271218.9923, 5272036.2399, 5272049.9757, 5271507.4245, 
                      5271239.5943, 5272036.2399, 5272266.345, 5270589.7139, 5270556.1843, 
                      5272216.0417, 5272266.345, 5246883.428, 5237972.8, 5237895.3531, 
                      5238282.5936, 5247503.6111, 5246883.428, 5227512.1703, 5227597.3282, 
                      5228266.5923, 5228286.8738, 5227512.1703, 5232252.738, 5229875.3018, 
                      5229528.6423, 5232104.1307, 5232252.738, 0, 0, 0, 0, 0, 0, 0, 
                      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
                      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
                      0, 0, 0, 0), 
                    dim = c(53L, 5L), 
                    dimnames = list(NULL, c("geom", "part", "x", "y", "hole")))

Solution

You pretty much answered yourself, here is how I would approach this using option 2.

First let's find which cells (cell column) are in which polygons (ID) and what fraction of them (fraction). This takes time but will only be performed once.

cells <- terra::extract(mrast[[1]], mpoly, cells=TRUE, exact=TRUE)
head(cells)
#   ID 2023-06-27 08:00:00    cell  fraction
# 1  1                 9.8 8629863 0.2353906
# 2  1                 9.6 8629864 0.4162643
# 3  1                 9.8 8629865 0.4321375
# 4  1                10.2 8629866 0.4480107
# 5  1                10.3 8629867 0.4638838
# 6  1                10.1 8629868 0.4797569

Using this information, we can extract data for polygons without dealing with their geometry, just using the cell numbers from above. Here is a function which takes a raster (rast) and the data above cell_data as its arguments and computes weighted (let's be exact!) means for each polygon:

polygon_means <- function(rast, cell_data) {
  # values for cells
  ex <- terra::extract(rast, cell_data$cell)
  # weighted means by polygons
  tapply(ex[, 1] * cell_data$fraction, cell_data$ID, sum) / 
    tapply(cell_data$fraction, cell_data$ID, sum)
}

Now we can do the computation:

names(mrast) <- sapply(mrast, names)
mavg <- sapply(mrast, polygon_means, cells)
mavg
#    2023-06-27 07:00:00 2023-06-27 08:00:00
# 1             0.000000           10.405166
# 2             0.000000            8.959888
# 3             0.000000            6.300000
# 4             0.000000            5.902433
# 5             0.000000            5.462595
# 6             0.000000            4.100000
# 7             0.000000            3.566484
# 8             0.000000           13.275182
# 9             2.600000           11.200000
# 10            4.633801           16.301409

This takes about 3 seconds on my machine, which is about 10 times faster than your original code (yes, 30 seconds on my laptop).

Thinking about your option 1, i.e. parallelizing, its usefulness would probably depend on your computing environment. Since the SpatRaster data is not actually in RAM and has to be read from the disk each time it is accessed, on a one-disk machine this might be the main bottleneck which distributing the computation among multiple processor cores would not help solve.

Extracting by cell numbers vs. row and column indices

I don't think there will be much practical difference in how you access the raster values. Under the hood, both extract and [ call some variant of a C++ extracting function (e.g., .extract_cell()). Compare

selectMethod(terra::extract, c('SpatRaster', 'numeric'))
selectMethod('[', c('SpatRaster', 'numeric', 'numeric'))