Search code examples
rstatisticspolygonrasterspatial

Zonal Statistics R (raster/polygon)


In R, there are deviations calculating the mean in function 'zonal.stats' of package 'spatialEco' compared to 'extract' in package 'raster'. For both I used a polygon as the zone field and a raster for the values.

Here is an example:

library(raster)
library(spatialEco)
library(sp)

#Create raster
ras <- raster(nrows=100, ncols=80, xmn=0, xmx=1000, ymn=0, ymx=800)
val <- runif(ncell(ras))
values(ras) <- val

#Create polygon within raster extent
xym <- cbind(runif(3,0,1000), runif(3,0,800))
p <- Polygons(list(Polygon(xym)),1)
sp <- SpatialPolygons(list(p))
spdf <- SpatialPolygonsDataFrame(sp, data=data.frame(1))

#z1 zonal statistics using "spatialECO"
z1 <- zonal.stats(spdf, ras, stats="mean")

#z2 zonal statistics using "raster"
z2 <- extract(ras, spdf, fun=mean)

What causes the deviation of z2 and z1?


Solution

  • spatialEco::zonal.stats uses exactextractr (I have not checked the code, but it told me to install it to be able to use zonal.stats) which should be more exact if you are considering polygons (the raster package turns them into rasters first, see zonal below). However, the below example (it is only one case) suggest that spatialEco is less precise.

    Example (avoid random numbers, but if you do use them, use set.seed). I start with very large grid cells.

    library(raster)
    library(spatialEco)
    
    ras <- raster(nrows=4, ncols=4, xmn=0, xmx=1000, ymn=0, ymx=800)
    values(ras) <- 1:ncell(ras)
    set.seed(1)
    xy <- cbind(runif(3,0,1000), runif(3,0,800))
    xy <- rbind(xy, xy[1,])
    sp <- spPolygons(xy, attr=data.frame(x=1))
    
    
    ### zonal statistics using "spatialECO"
    zonal.stats(sp, ras, stats="mean")
    #  mean.layer
    #1          7
    ### zonal statistics using "raster"
    extract(ras, sp, fun=mean)
    #     [,1]
    #[1,]    6
    ### same as 
    # x <- rasterize(sp, ras)
    # zonal(ras, x, "mean")
    

    With raster you can also get a more precise estimate like this

    e <- extract(ras, sp, weights=T)[[1]]
    weighted.mean(e[,1], e[,2])
    #[1] 5.269565
    

    To see how many cells are used

    zonal.stats(sp, ras, stats="counter")
    #  counter.layer
    #1             6
    extract(ras, sp, fun=function(x,...)length(x))
    #     [,1]
    #[1,]    3
    

    One way to look at this is to create higher resolution raster data.

    10x higher resolution, same values

    ras <- disaggregate(ras, 10)
    zonal.stats(sp, ras, stats="mean")
    #  mean.layer
    #1        5.5
    extract(ras, sp, fun=mean)
    #         [,1]
    #[1,] 5.245614
    
    zonal.stats(sp, ras, stats="counter")
    #  counter.layer
    #1           218
    extract(ras, sp, fun=function(x,...)length(x))
    #     [,1]
    #[1,]  171
    

    100x higher resolution, same values

    ras <- disaggregate(ras, 10)
    zonal.stats(sp, ras, stats="mean")
    #mean.layer
    #1   5.299915
    extract(ras, sp, small=TRUE, fun=mean)
    #         [,1]
    #[1,] 5.271039
    
    
    zonal.stats(sp, ras, stats="counter")
    # counter.layer
    #1         17695
    
    extract(ras, sp, fun=function(x,...)length(x))
    #      [,1]
    #[1,] 17289
    

    At the highest resolution the mean values are similar (and the relative difference in the number of cells is small); but raster was closer to the correct value (whatever that is, exactly) at a lower resolution (and also with the weighted mean). That is unexpected.

    For better speed, there is now also the terra package

    library(terra)    
    r <- rast(ras)
    v <- vect(sp)
    extract(r, v, "mean")    
    #     ID    layer
    #[1,]  1 5.271039