I'm having quite a problem when converting a zarr file to a dask array. This is what I get when I type arr = da.from_zarr('gros.zarr/time')
:
but when I try on one coordinates such as time it works:
Any Ideas how to solve this ?
When you read a zarr array in xarray, dask will be enabled by default, unless you specify chunks=None
. You absolutely do not have to go through dask.dataframe - you can go straight from xarray.DataArray
to dask.Array
. In fact, there's not even a copy required - all you need to do is access the .data
attribute underlying the DataArray.
Here's an example from a file I have laying around:
In [3]: import xarray as xr
...: import os
...:
...: fp = os.path.join(
...: ROOT_DIR,
...: 'ScenarioMIP/INM/INM-CM5-0/ssp370/r1i1p1f1/day/tasmax/v1.1.zarr'
...: )
...:
...: ds = xr.open_zarr(fp)
...: ds
Out[3]:
<xarray.Dataset>
Dimensions: (lat: 720, lon: 1440, time: 31390)
Coordinates:
* lat (lat) float64 -89.88 -89.62 -89.38 -89.12 ... 89.38 89.62 89.88
* lon (lon) float64 -179.9 -179.6 -179.4 -179.1 ... 179.4 179.6 179.9
* time (time) object 2015-01-01 12:00:00 ... 2100-12-31 12:00:00
Data variables:
tasmax (time, lat, lon) float32 dask.array<chunksize=(365, 360, 360), meta=np.ndarray>
Attributes: (12/47)
Conventions: CF-1.7 CMIP-6.2
activity_id: ScenarioMIP AerChemMIP
contact: climatesci@rhg.com
creation_date: 2019-06-17T08:27:21Z
data_specs_version: 01.00.29
dc6_bias_correction_method: Quantile Delta Method (QDM)
... ...
sub_experiment_id: none
table_id: day
tracking_id: hdl:21.14100/da7e759e-3979-42e4-b92f-02e7e2...
variable_id: tasmax
variant_label: r1i1p1f1
version_id: v20190618
You can think of xarray Dataset
s as fancy dictionaries holding DataArrays as objects. DataArray
s themselves are just N-dimensional arrays with labeled indices. The data contained in a DataArray is provided by an array "backend", which is usually numpy or dask.Array. When you read in a zarr dataset, the result will be a dask.Array with a bit of extra xarray index & metadata handling on top. We can see that the values in this array are a dask array by inspecting the array preview at the top:
In [4]: ds.tasmax
Out[4]:
<xarray.DataArray 'tasmax' (time: 31390, lat: 720, lon: 1440)>
dask.array<open_dataset-51b28ad08603ab401a85808d9fa3d6d7tasmax, shape=(31390, 720, 1440), dtype=float32, chunksize=(365, 360, 360), chunktype=numpy.ndarray>
Coordinates:
* lat (lat) float64 -89.88 -89.62 -89.38 -89.12 ... 89.38 89.62 89.88
* lon (lon) float64 -179.9 -179.6 -179.4 -179.1 ... 179.4 179.6 179.9
* time (time) object 2015-01-01 12:00:00 ... 2100-12-31 12:00:00
Attributes:
cell_measures: area: areacella
cell_methods: area: mean time: maximum (interval: 1 day)
comment: maximum near-surface (usually, 2 meter) air temperature (...
coordinates: height
history: 2019-06-17T08:27:21Z altered by CMOR: Treated scalar dime...
long_name: Daily Maximum Near-Surface Air Temperature
original_name: tasmax
standard_name: air_temperature
units: K
Xarray is a great library which allows you to use pandas-style indexing in an N-dimensional space. But if you want to work with the dask.array directly, you can simply access the .data
attribute on a dask-backed xarray DataArray:
In [5]: ds.tasmax.data
Out[5]: dask.array<open_dataset-51b28ad08603ab401a85808d9fa3d6d7tasmax, shape=(31390, 720, 1440), dtype=float32, chunksize=(365, 360, 360), chunktype=numpy.ndarray>