How to use Dask.Array.From_Zarr to open a zarr file on Dask?

I'm having quite a problem when converting a zarr file to a dask array. This is what I get when I type arr = da.from_zarr('gros.zarr/time') :

but when I try on one coordinates such as time it works:

Any Ideas how to solve this ?

Solution

When you read a zarr array in xarray, dask will be enabled by default, unless you specify chunks=None. You absolutely do not have to go through dask.dataframe - you can go straight from xarray.DataArray to dask.Array. In fact, there's not even a copy required - all you need to do is access the .data attribute underlying the DataArray.

Here's an example from a file I have laying around:

In [3]: import xarray as xr
   ...: import os
   ...:
   ...: fp = os.path.join(
   ...:     ROOT_DIR,
   ...:     'ScenarioMIP/INM/INM-CM5-0/ssp370/r1i1p1f1/day/tasmax/v1.1.zarr'
   ...: )
   ...: 
   ...: ds = xr.open_zarr(fp)
   ...: ds
Out[3]:
<xarray.Dataset>
Dimensions:  (lat: 720, lon: 1440, time: 31390)
Coordinates:
  * lat      (lat) float64 -89.88 -89.62 -89.38 -89.12 ... 89.38 89.62 89.88
  * lon      (lon) float64 -179.9 -179.6 -179.4 -179.1 ... 179.4 179.6 179.9
  * time     (time) object 2015-01-01 12:00:00 ... 2100-12-31 12:00:00
Data variables:
    tasmax   (time, lat, lon) float32 dask.array<chunksize=(365, 360, 360), meta=np.ndarray>
Attributes: (12/47)
    Conventions:                  CF-1.7 CMIP-6.2
    activity_id:                  ScenarioMIP AerChemMIP
    contact:                      climatesci@rhg.com
    creation_date:                2019-06-17T08:27:21Z
    data_specs_version:           01.00.29
    dc6_bias_correction_method:   Quantile Delta Method (QDM)
    ...                           ...
    sub_experiment_id:            none
    table_id:                     day
    tracking_id:                  hdl:21.14100/da7e759e-3979-42e4-b92f-02e7e2...
    variable_id:                  tasmax
    variant_label:                r1i1p1f1
    version_id:                   v20190618

You can think of xarray Datasets as fancy dictionaries holding DataArrays as objects. DataArrays themselves are just N-dimensional arrays with labeled indices. The data contained in a DataArray is provided by an array "backend", which is usually numpy or dask.Array. When you read in a zarr dataset, the result will be a dask.Array with a bit of extra xarray index & metadata handling on top. We can see that the values in this array are a dask array by inspecting the array preview at the top:

In [4]: ds.tasmax
Out[4]:
<xarray.DataArray 'tasmax' (time: 31390, lat: 720, lon: 1440)>
dask.array<open_dataset-51b28ad08603ab401a85808d9fa3d6d7tasmax, shape=(31390, 720, 1440), dtype=float32, chunksize=(365, 360, 360), chunktype=numpy.ndarray>
Coordinates:
  * lat      (lat) float64 -89.88 -89.62 -89.38 -89.12 ... 89.38 89.62 89.88
  * lon      (lon) float64 -179.9 -179.6 -179.4 -179.1 ... 179.4 179.6 179.9
  * time     (time) object 2015-01-01 12:00:00 ... 2100-12-31 12:00:00
Attributes:
    cell_measures:  area: areacella
    cell_methods:   area: mean time: maximum (interval: 1 day)
    comment:        maximum near-surface (usually, 2 meter) air temperature (...
    coordinates:    height
    history:        2019-06-17T08:27:21Z altered by CMOR: Treated scalar dime...
    long_name:      Daily Maximum Near-Surface Air Temperature
    original_name:  tasmax
    standard_name:  air_temperature
    units:          K

Xarray is a great library which allows you to use pandas-style indexing in an N-dimensional space. But if you want to work with the dask.array directly, you can simply access the .data attribute on a dask-backed xarray DataArray:

In [5]: ds.tasmax.data
Out[5]: dask.array<open_dataset-51b28ad08603ab401a85808d9fa3d6d7tasmax, shape=(31390, 720, 1440), dtype=float32, chunksize=(365, 360, 360), chunktype=numpy.ndarray>