I have a bunch of NetCDF (.nc
) files (ERA5 dataset) that I'm reading in Python through xarray
and rioxarray
. They end up as arrays of float32
(4 bytes) in memory.
However, on disk they are stored as short
(2 bytes):
$ ncdump -h file.nc
...
short u100(time, latitude, longitude) ;
u100:scale_factor = 0.000895262699529722 ;
u100:add_offset = 2.29252111865024 ;
u100:_FillValue = -32767s ;
u100:missing_value = -32767s ;
...
Apparently xarray automatically applies the offset and scale factor to convert these integers back into floats while reading the NetCDF file.
Now I'm rechunking these and storing them as zarr, so I can efficiently access entire time series at a single geographical location. However, the zarr files end up at almost twice the size of the original NetCDFs, because the data remain stored as floats. Because it's about a terabyte in its original form, bandwidth and storage considerations are important, so I'd like to make this smaller. And we're not gaining anything by this additional storage size; the incoming data only had 16 bits of precision to begin with.
I know I could just manually convert the data back to shorts on the way into zarr, and back to floats on the way out of zarr, but that's tedious and error-prone (even when it happens automatically).
Is there a way to do this transparently, the way it seems to happen with NetCDF?
I had been writing with the zarr
package directly, which doesn't seem to support this. But xarray
does, through its encoding
argument!
>>> import xarray
>>> ds = xr.Dataset(data_vars={'my_var': xr.DataArray([1.0, 1.5, 2.0])})
>>> ds.to_zarr(
'/tmp/test.zarr',
encoding={
'my_variable': dict(
scale_factor=0.1,
dtype='int16',
)
})
The zarr on disk ends up with the right format, scaling and attributes:
>>> import zarr
>>> my_var = zarr.open('/tmp/test.zarr')['my_var']
>>> my_var.info
Name : /my_var
Type : zarr.core.Array
Data type : int16
Shape : (3,)
Chunk shape : (3,)
Order : C
Read-only : False
Compressor : Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)
Store type : zarr.storage.DirectoryStore
No. bytes : 6
No. bytes stored : 411
Storage ratio : 0.0
Chunks initialized : 1/1
>>> dict(my_var.attrs)
{'_ARRAY_DIMENSIONS': ['dim_0'], 'scale_factor': 0.1}
>>> my_var[:]
array([10, 15, 20], dtype=int16)
When opening the dataset, we have to use xarray
as well, and pass mask_and_scale=True
to apply the scaling:
>>> xr.open_dataset('/tmp/test.zarr', mask_and_scale=True)['my_var'].values
array([1. , 1.5, 2. ], dtype=float32)