Search code examples
pythonnumpydaskpython-xarrayzarr

How to use Dask.Array.From_Zarr to open a zarr file on Dask?


I'm having quite a problem when converting a zarr file to a dask array. This is what I get when I type arr = da.from_zarr('gros.zarr/time') : enter image description here

but when I try on one coordinates such as time it works: enter image description here

Any Ideas how to solve this ?


Solution

  • When you read a zarr array in xarray, dask will be enabled by default, unless you specify chunks=None. You absolutely do not have to go through dask.dataframe - you can go straight from xarray.DataArray to dask.Array. In fact, there's not even a copy required - all you need to do is access the .data attribute underlying the DataArray.

    Here's an example from a file I have laying around:

    In [3]: import xarray as xr
       ...: import os
       ...:
       ...: fp = os.path.join(
       ...:     ROOT_DIR,
       ...:     'ScenarioMIP/INM/INM-CM5-0/ssp370/r1i1p1f1/day/tasmax/v1.1.zarr'
       ...: )
       ...: 
       ...: ds = xr.open_zarr(fp)
       ...: ds
    Out[3]:
    <xarray.Dataset>
    Dimensions:  (lat: 720, lon: 1440, time: 31390)
    Coordinates:
      * lat      (lat) float64 -89.88 -89.62 -89.38 -89.12 ... 89.38 89.62 89.88
      * lon      (lon) float64 -179.9 -179.6 -179.4 -179.1 ... 179.4 179.6 179.9
      * time     (time) object 2015-01-01 12:00:00 ... 2100-12-31 12:00:00
    Data variables:
        tasmax   (time, lat, lon) float32 dask.array<chunksize=(365, 360, 360), meta=np.ndarray>
    Attributes: (12/47)
        Conventions:                  CF-1.7 CMIP-6.2
        activity_id:                  ScenarioMIP AerChemMIP
        contact:                      climatesci@rhg.com
        creation_date:                2019-06-17T08:27:21Z
        data_specs_version:           01.00.29
        dc6_bias_correction_method:   Quantile Delta Method (QDM)
        ...                           ...
        sub_experiment_id:            none
        table_id:                     day
        tracking_id:                  hdl:21.14100/da7e759e-3979-42e4-b92f-02e7e2...
        variable_id:                  tasmax
        variant_label:                r1i1p1f1
        version_id:                   v20190618
    

    You can think of xarray Datasets as fancy dictionaries holding DataArrays as objects. DataArrays themselves are just N-dimensional arrays with labeled indices. The data contained in a DataArray is provided by an array "backend", which is usually numpy or dask.Array. When you read in a zarr dataset, the result will be a dask.Array with a bit of extra xarray index & metadata handling on top. We can see that the values in this array are a dask array by inspecting the array preview at the top:

    In [4]: ds.tasmax
    Out[4]:
    <xarray.DataArray 'tasmax' (time: 31390, lat: 720, lon: 1440)>
    dask.array<open_dataset-51b28ad08603ab401a85808d9fa3d6d7tasmax, shape=(31390, 720, 1440), dtype=float32, chunksize=(365, 360, 360), chunktype=numpy.ndarray>
    Coordinates:
      * lat      (lat) float64 -89.88 -89.62 -89.38 -89.12 ... 89.38 89.62 89.88
      * lon      (lon) float64 -179.9 -179.6 -179.4 -179.1 ... 179.4 179.6 179.9
      * time     (time) object 2015-01-01 12:00:00 ... 2100-12-31 12:00:00
    Attributes:
        cell_measures:  area: areacella
        cell_methods:   area: mean time: maximum (interval: 1 day)
        comment:        maximum near-surface (usually, 2 meter) air temperature (...
        coordinates:    height
        history:        2019-06-17T08:27:21Z altered by CMOR: Treated scalar dime...
        long_name:      Daily Maximum Near-Surface Air Temperature
        original_name:  tasmax
        standard_name:  air_temperature
        units:          K
    

    Xarray is a great library which allows you to use pandas-style indexing in an N-dimensional space. But if you want to work with the dask.array directly, you can simply access the .data attribute on a dask-backed xarray DataArray:

    In [5]: ds.tasmax.data
    Out[5]: dask.array<open_dataset-51b28ad08603ab401a85808d9fa3d6d7tasmax, shape=(31390, 720, 1440), dtype=float32, chunksize=(365, 360, 360), chunktype=numpy.ndarray>