Search code examples
pythongeojsongeopandaspython-xarrayzarr

Filter xarray ZARR dataset with GeoDataFrame


I am reading a ZARR file from a s3 bucket with xarray. I got to successfully filter by time and latitude/longitude:

    def read_zarr(self, dataset: str, region: Region) -> Any:
        # Read ZARR from s3 bucket
        fs = s3fs.S3FileSystem(key="KEY", secret="SECRET")
        mapper = fs.get_mapper(f"{self.S3_PATH}{dataset}")
        zarr_ds = xr.open_zarr(mapper, decode_times=True)

        # Filter by time
        time_period = pd.date_range("2013-01-01", "2023-01-31")
        zarr_ds = zarr_ds.sel(time=time_period)

        # Filter by latitude/longitude
        region_gdf = region.geo_data_frame
        latitude_slice = slice(region_gdf.bounds.miny[0], region_gdf.bounds.maxy[0])
        longitude_slice = slice(region_gdf.bounds.minx[0], region_gdf.bounds.maxx[0])
        return zarr_ds.sel(latitude=latitude_slice, longitude=longitude_slice)

The problem is that this returns a rectangle of data (actually a cuboid, if we consider the time dimension). For geographical regions that are long and thin, this will represent a huge waste, as I will first download years of data, to then discard most of it. Example with California:

enter image description here

I would like to intersect the ZARR coordinates with the region ones. How can I achieve it?


Solution

  • One option is to use rioxarray and then clip to your region, very similar to this thread.

    You will need to set the coordinate system. Assuming your data are given as standard lat/lon, this is epsg 4326 (see opening sentences here). You may need to use region_gdf.to_crs('epsg:4326') to make sure that the coordinate systems are the same (or set_crs if it doesn't have a coordinate reference system already).

    Once this formatting is done, you can use the rio.clip method and set drop=True to only keep the points in the region.

    New lines below - it might not work exactly as written because I can't test it, but I think it should get you close.

    import rioxarray 
    zarr_ds.rio.set_spatial_dims(x_dim="lon", y_dim="lat", inplace=True)
    zarr_ds.rio.write_crs("epsg:4326", inplace=True)
    zarr_ds_clipped = zarr_ds.rio.clip(region_gdf, drop=True)