Search code examples
pythondaskpython-xarrayzarr

xarray.Dataset.to_zarr: overwrite data if exists with append_dim


With xarray.Dataset.to_zarr it is possible to write an xarray to a .zarr file and append new data along a dimension using the append_dim parameter.

However, if the coordinate of the new data for this dimension is already there, the existing data won't be replaced. Rather the same coordinate appears twice in the resulting dateset.

Example using the data from here:

Here I write 2 Datasets to the same .zarr file. The datasets are appended along the space dimension. Both datasets contain the same space coordinate "IL"

ds_A = xr.DataArray(
    np.random.rand(4, 2),
    [
        ("time", pd.date_range("2000-01-01", periods=4)),
        ("space", ["IA", "IL"]),
    ],
).to_dataset(name="measurements")


ds_B = xr.DataArray(
    np.random.rand(4, 2),
    [
        ("time", pd.date_range("2000-01-01", periods=4)),
        ("space", ["IL", "NY"]),
    ],
).to_dataset(name="measurements")


ds_A.to_zarr("weather.zarr", append_dim="space")
ds_B.to_zarr("weather.zarr", append_dim="space");

When reading the file, the second dataset didn't overwrite the data for the "IL" coordinate, but crated a new one:

xr.open_zarr("weather.zarr")


<xarray.Dataset>
Dimensions:       (space: 4, time: 4)
Coordinates:
  * space         (space) <U2 'IA' 'IL' 'IL' 'NY'
  * time          (time) datetime64[ns] 2000-01-01 2000-01-02 ... 2000-01-04
Data variables:
    measurements  (time, space) float64 dask.array<chunksize=(4, 2), meta=np.ndarray>

This would be the desired result:

<xarray.Dataset>
Dimensions:       (space: 3, time: 4)
Coordinates:
  * space         (space) <U2 'IA'  'IL' 'NY'
  * time          (time) datetime64[ns] 2000-01-01 2000-01-02 ... 2000-01-04
Data variables:
    measurements  (time, space) float64 dask.array<chunksize=(3, 2), meta=np.ndarray>

Does anybody know if it is possible to replace the data if the coordinate already exists?


Solution

  • I don't think there's an out-of-the-box way to do this, appending always adds the full dataset to the end.

    However, version 0.16.2 of xarray introduced the keyword region to to_zarr, which lets you write to limited region of a zarr file.

    You can use it, to overwrite the existing data:

    # write first dataset
    ds_A.to_zarr("weather.zarr")
    
    # read structure of dataset to see what's on disk
    ds_ondisk = xr.open_zarr('weather.zarr/')
    
    # get index of first new datapoint
    start_ix, = np.nonzero(~np.isin(ds_B.space, ds_ondisk.space))
    
    # region of new data
    region_new = slice(start_ix[0], ds_B.space.size)
    
    # append structure of new data (compute=False means no data is written)
    ds_B.isel(space=region_new).to_zarr("weather.zarr", append_dim='space', compute=False)
    
    # get updated dataset size and create slice
    ds_ondisk = xr.open_zarr('weather.zarr/')
    region_update = slice(start_ix[0], ds_ondisk.space.size)
    
    # write new data to zarr (time needs to be dropped)
    ds_B.drop("time").to_zarr("weather.zarr", region={"space": region_update})
    
    # produces
    xr.open_zarr('weather.zarr/')
    
    <xarray.Dataset>
    Dimensions:       (space: 3, time: 4)
    Coordinates:
      * space         (space) <U2 'IA' 'IL' 'NY'
      * time          (time) datetime64[ns] 2000-01-01 2000-01-02 ... 2000-01-04
    Data variables:
        measurements  (time, space) float64 dask.array<chunksize=(4, 2), meta=np.ndarray>