Search code examples
pythondaskpython-xarrayzarr

Problems with chunksize (Dask, xarray, zarr)


I want to save an xarray.dataset as a .zarr file, but I cannot configure my chunks to be uniform and it will not save.

I have tried:

changing chunk size when using xarray.open_mfdataset -> it still uses auto chunks which do not work.

changing chunk size when using dataset.chunk(n) -> still refers to automatic chunks when opening dataset.

CODE:

import xarray as xr
import glob
import zarr

local_dir = "/directory/"
data_dir = local_dir + 'folder/'

files = glob.glob(data_dir + '*.nc')
n = 1320123
data_files = xr.open_mfdataset(files,concat_dim='TIME',chunks={'TIME': n}) # does not specify chunks, uses automatic chunks
data_files.chunk(n) # try modifying here, still uses automatic chunks
data_files.to_zarr(store=data_dir + 'test.zarr',mode='w') # I get an error about non-uniform chunks - see below

ValueError: Zarr requires uniform chunk sizes except for final chunk. Variable dask chunks ((1143410, 512447, 1170473, 281220, 852819),) are incompatible. Consider rechunking using chunk().

I expect the .zarr file to save with new chunks, but refers back to original automatic chunksizes.


Solution

  • Xarray's Dataset.chunk method return a new dataset, so you would need something more like:

    ds = xr.open_mfdataset(files, concat_dim='TIME').chunk({'TIME': n})
    ds.to_zarr(...)
    

    A few other details to note:

    • Why the chunks kwarg open_mfdataset doesn't behave as desired: Currently, chunks along the concat_dim are fixed to the length of data in each file. I also suspect this is why you have irregular chunk sizes.

    • open_mfdataset will do the glob for you. This a minor time savor but something to note in the future, you can just call xr.open_mfdataset('/directory/folder/*nc', ...).