Search code examples
python-xarraynetcdfzarr

Do zarr arrays natively support integer scaling and offsets like NetCDF? If not, is there a workaround?


I have a bunch of NetCDF (.nc) files (ERA5 dataset) that I'm reading in Python through xarray and rioxarray. They end up as arrays of float32 (4 bytes) in memory.

However, on disk they are stored as short (2 bytes):

$ ncdump -h file.nc
...
    short u100(time, latitude, longitude) ;
        u100:scale_factor = 0.000895262699529722 ;
        u100:add_offset = 2.29252111865024 ;
        u100:_FillValue = -32767s ;
        u100:missing_value = -32767s ;
...

Apparently xarray automatically applies the offset and scale factor to convert these integers back into floats while reading the NetCDF file.

Now I'm rechunking these and storing them as zarr, so I can efficiently access entire time series at a single geographical location. However, the zarr files end up at almost twice the size of the original NetCDFs, because the data remain stored as floats. Because it's about a terabyte in its original form, bandwidth and storage considerations are important, so I'd like to make this smaller. And we're not gaining anything by this additional storage size; the incoming data only had 16 bits of precision to begin with.

I know I could just manually convert the data back to shorts on the way into zarr, and back to floats on the way out of zarr, but that's tedious and error-prone (even when it happens automatically).

Is there a way to do this transparently, the way it seems to happen with NetCDF?


Solution

  • I had been writing with the zarr package directly, which doesn't seem to support this. But xarray does, through its encoding argument!

    >>> import xarray
    >>> ds = xr.Dataset(data_vars={'my_var': xr.DataArray([1.0, 1.5, 2.0])})
    >>> ds.to_zarr(
        '/tmp/test.zarr',
        encoding={
            'my_variable': dict(
                scale_factor=0.1,
                dtype='int16',
            )
        })
    

    The zarr on disk ends up with the right format, scaling and attributes:

    >>> import zarr
    >>> my_var = zarr.open('/tmp/test.zarr')['my_var']
    >>> my_var.info
    Name               : /my_var
    Type               : zarr.core.Array
    Data type          : int16
    Shape              : (3,)
    Chunk shape        : (3,)
    Order              : C
    Read-only          : False
    Compressor         : Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)
    Store type         : zarr.storage.DirectoryStore
    No. bytes          : 6
    No. bytes stored   : 411
    Storage ratio      : 0.0
    Chunks initialized : 1/1
    >>> dict(my_var.attrs)
    {'_ARRAY_DIMENSIONS': ['dim_0'], 'scale_factor': 0.1}
    >>> my_var[:]
    array([10, 15, 20], dtype=int16)
    

    When opening the dataset, we have to use xarray as well, and pass mask_and_scale=True to apply the scaling:

    >>> xr.open_dataset('/tmp/test.zarr', mask_and_scale=True)['my_var'].values
    array([1. , 1.5, 2. ], dtype=float32)