I am reading a ZARR file from a s3 bucket with xarray. I got to successfully filter by time and latitude/longitude:
def read_zarr(self, dataset: str, region: Region) -> Any:
# Read ZARR from s3 bucket
fs = s3fs.S3FileSystem(key="KEY", secret="SECRET")
mapper = fs.get_mapper(f"{self.S3_PATH}{dataset}")
zarr_ds = xr.open_zarr(mapper, decode_times=True)
# Filter by time
time_period = pd.date_range("2013-01-01", "2023-01-31")
zarr_ds = zarr_ds.sel(time=time_period)
# Filter by latitude/longitude
region_gdf = region.geo_data_frame
latitude_slice = slice(region_gdf.bounds.miny[0], region_gdf.bounds.maxy[0])
longitude_slice = slice(region_gdf.bounds.minx[0], region_gdf.bounds.maxx[0])
return zarr_ds.sel(latitude=latitude_slice, longitude=longitude_slice)
The problem is that this returns a rectangle of data (actually a cuboid, if we consider the time dimension). For geographical regions that are long and thin, this will represent a huge waste, as I will first download years of data, to then discard most of it. Example with California:
I would like to intersect the ZARR coordinates with the region ones. How can I achieve it?
One option is to use rioxarray
and then clip to your region, very similar to this thread.
You will need to set the coordinate system. Assuming your data are given as standard lat/lon, this is epsg 4326 (see opening sentences here). You may need to use region_gdf.to_crs('epsg:4326')
to make sure that the coordinate systems are the same (or set_crs
if it doesn't have a coordinate reference system already).
Once this formatting is done, you can use the rio.clip
method and set drop=True
to only keep the points in the region.
New lines below - it might not work exactly as written because I can't test it, but I think it should get you close.
import rioxarray
zarr_ds.rio.set_spatial_dims(x_dim="lon", y_dim="lat", inplace=True)
zarr_ds.rio.write_crs("epsg:4326", inplace=True)
zarr_ds_clipped = zarr_ds.rio.clip(region_gdf, drop=True)