python raster mask shapefile python-xarray

Compute polygon mean from raster: rasterstats.zonal_stats versus rio.clip

I have a raster file and a shapefile containing polygons. For each polygon, I want to compute the mean value of all raster cells within that polygon.

I had been doing this with rasterstats.zonal_stats, which works perfectly but is very slow for large rasters. Recently I came across rioxarray's example of instead clipping the raster to the polygon and then computing the mean. From my experiments, the clipping approach is a lot faster than zonal_stats and the results are the same.

I am therefore wondering if there is anything else except for the time difference that would lead to a preference of one over the other method? And why it is that the clipping is so much faster than the zonal_stats?

Below the output and timing of the two methods, and a snippet of the code. The full code can be found here.

It would be great to get insights on this :)

--- Raster stats ---
    ADM1_EN   mean_adm
0   Central  14.624690
1  Northern   3.950312
2  Southern  20.649534
--- Raster stats: 15.52 seconds ---

--- Rio clip ---
           mean_adm
Central   14.624689
Northern   3.950313
Southern  20.649534
--- Rio clip: 0.28 seconds ---

#load the data
ds=rioxarray.open_rasterio(chirps_path,masked=True)
ds=ds.rio.write_crs("EPSG:4326")
ds_date=ds.sel(time="2020-01-01").squeeze()

#use rasterstats
start_time = time.time()
gdf = gpd.read_file(hdx_adm1_path)[["ADM1_EN", "geometry"]]
gdf["mean_adm"] = pd.DataFrame(
            zonal_stats(
                vectors=gdf,
                raster=ds_date.values,
                affine=ds_date.rio.transform(),
                nodata=np.nan,
                all_touched=False
            )
        )["mean"]
print("--- Raster stats ---")
display(gdf[["ADM1_EN","mean_adm"]])
print(f"--- Raster stats: {(time.time() - start_time):.2f} seconds ---")

#use clipping
start_time = time.time()
gdf = gpd.read_file(hdx_adm1_path)[["ADM1_EN", "geometry"]]
df_adm=pd.DataFrame(index=gdf.ADM1_EN.unique())
for a in gdf.ADM1_EN.unique():
    gdf_adm=gdf[gdf.ADM1_EN==a]

    da_clip = ds_date.rio.set_spatial_dims(
        x_dim="x", y_dim="y"
    ).rio.clip(
        gdf_adm["geometry"], all_touched=False
    )

    grid_mean = da_clip.mean(dim=["x", "y"],skipna=True).rename("mean_adm")
    df_adm.loc[a,"mean_adm"]=grid_mean.values
print("--- Rio clip ---")
display(df_adm)
print(f"--- Rio clip: {(time.time() - start_time):.2f} seconds ---")

Solution

In the view of self-learning, I realized I made a mistake here and I think it is beneficial to report.

It all depends on the order you run the functions in. As I loaded the raster data with rioxarray.open_rasterio, this data is not loaded into memory. Thereafter both the clip and zonal_stats still have to load this into memory. When one of the two is called before the other, the latter makes use of the fact that the data is already loaded.

You can also choose to instead call ds.load() before calling the fuctions. In my tests this didn't change the total computation time.