I'm opening a zarr file and then rechunking it and then writing it back out to a different zarr store. Yet when I open it back up it doesn't respect the chunk size I previously wrote. Here is the code and the output from jupyter. Any idea what I'm doing wrong here?
bathy_ds = xr.open_zarr('data/bathy_store')
bathy_ds.elevation
bathy_ds.chunk(5000).elevation
bathy_ds.chunk(5000).to_zarr('data/elevation_store')
new_ds = xr.open_zarr('data/elevation_store')
new_ds.elevation
It is reverting back to the original chunking as if I'm not fully overwriting it or changing some other setting that needs changing.
This seems to be a known issue, and there's a fair bit of discussion going on within the issue's thread and a recently merged PR.
Basically, the dataset carries the original chunking around in the .encoding
property. So when you call the second write operation, the chunks defined in ds[var].encoding['chunks']
(if present) will be used to write var
to zarr.
According to the conversation in the GH issue, the currently best solution is to manually delete the chunk encoding for the variables in question:
for var in ds:
del ds[var].encoding['chunks']
However, it should be noted that this seems to be an evolving situation, where it's be good to check in on the progress to adapt a final solution.
Here's a little example that showcases the issue and solution:
import xarray as xr
# load data and write to initial chunking
x = xr.tutorial.load_dataset("air_temperature")
x.chunk({"time":500, "lat":-1, "lon":-1}).to_zarr("zarr1.zarr")
# display initial chunking
xr.open_zarr("zarr1.zarr/").air
# rechunk
y = xr.open_zarr("zarr1.zarr/").chunk({"time": -1})
# display
y.air
#write w/o modifying .encoding
y.to_zarr("zarr2.zarr")
# display
xr.open_zarr("zarr2.zarr/").air
# delete encoding and store
del y.air.encoding['chunks']
y.to_zarr("zarr3.zarr")
# display
xr.open_zarr("zarr3.zarr/").air