Zonal Statistics R (raster/polygon)

In R, there are deviations calculating the mean in function 'zonal.stats' of package 'spatialEco' compared to 'extract' in package 'raster'. For both I used a polygon as the zone field and a raster for the values.

Here is an example:

library(raster)
library(spatialEco)
library(sp)

#Create raster
ras <- raster(nrows=100, ncols=80, xmn=0, xmx=1000, ymn=0, ymx=800)
val <- runif(ncell(ras))
values(ras) <- val

#Create polygon within raster extent
xym <- cbind(runif(3,0,1000), runif(3,0,800))
p <- Polygons(list(Polygon(xym)),1)
sp <- SpatialPolygons(list(p))
spdf <- SpatialPolygonsDataFrame(sp, data=data.frame(1))

#z1 zonal statistics using "spatialECO"
z1 <- zonal.stats(spdf, ras, stats="mean")

#z2 zonal statistics using "raster"
z2 <- extract(ras, spdf, fun=mean)

What causes the deviation of z2 and z1?

Solution

spatialEco::zonal.stats uses exactextractr (I have not checked the code, but it told me to install it to be able to use zonal.stats) which should be more exact if you are considering polygons (the raster package turns them into rasters first, see zonal below). However, the below example (it is only one case) suggest that spatialEco is less precise.

Example (avoid random numbers, but if you do use them, use set.seed). I start with very large grid cells.

library(raster)
library(spatialEco)

ras <- raster(nrows=4, ncols=4, xmn=0, xmx=1000, ymn=0, ymx=800)
values(ras) <- 1:ncell(ras)
set.seed(1)
xy <- cbind(runif(3,0,1000), runif(3,0,800))
xy <- rbind(xy, xy[1,])
sp <- spPolygons(xy, attr=data.frame(x=1))


### zonal statistics using "spatialECO"
zonal.stats(sp, ras, stats="mean")
#  mean.layer
#1          7
### zonal statistics using "raster"
extract(ras, sp, fun=mean)
#     [,1]
#[1,]    6
### same as 
# x <- rasterize(sp, ras)
# zonal(ras, x, "mean")

With raster you can also get a more precise estimate like this

e <- extract(ras, sp, weights=T)[[1]]
weighted.mean(e[,1], e[,2])
#[1] 5.269565

To see how many cells are used

zonal.stats(sp, ras, stats="counter")
#  counter.layer
#1             6
extract(ras, sp, fun=function(x,...)length(x))
#     [,1]
#[1,]    3

One way to look at this is to create higher resolution raster data.

10x higher resolution, same values

ras <- disaggregate(ras, 10)
zonal.stats(sp, ras, stats="mean")
#  mean.layer
#1        5.5
extract(ras, sp, fun=mean)
#         [,1]
#[1,] 5.245614

zonal.stats(sp, ras, stats="counter")
#  counter.layer
#1           218
extract(ras, sp, fun=function(x,...)length(x))
#     [,1]
#[1,]  171

100x higher resolution, same values

ras <- disaggregate(ras, 10)
zonal.stats(sp, ras, stats="mean")
#mean.layer
#1   5.299915
extract(ras, sp, small=TRUE, fun=mean)
#         [,1]
#[1,] 5.271039


zonal.stats(sp, ras, stats="counter")
# counter.layer
#1         17695

extract(ras, sp, fun=function(x,...)length(x))
#      [,1]
#[1,] 17289

At the highest resolution the mean values are similar (and the relative difference in the number of cells is small); but raster was closer to the correct value (whatever that is, exactly) at a lower resolution (and also with the weighted mean). That is unexpected.

For better speed, there is now also the terra package

library(terra)    
r <- rast(ras)
v <- vect(sp)
extract(r, v, "mean")    
#     ID    layer
#[1,]  1 5.271039