Search code examples
python-xarray

DataArray deletes Attributes in simple computation


I've noticed that if you have a xArray DatarArray and perform simple(!) calculations on it the Attributes get 'deleted'.

Example:

example            = xr.DataArray(np.array([1,2,3]), attrs={'one':1})
without_Attributes = example*3

On the other side, if you use numpy specific functions (e.g. .round(x), ..) the Attributes remain. Is there a reasonable explanation for this? And is there a way to multiply the DataArray without loosing its attributes?


Solution

  • From the xarray docs on "what is your approach to metadata?":

    We are firm believers in the power of labeled data! In addition to dimensions and coordinates, xarray supports arbitrary metadata in the form of global (Dataset) and variable specific (DataArray) attributes (attrs).

    Automatic interpretation of labels is powerful but also reduces flexibility. With xarray, we draw a firm line between labels that the library understands (dims and coords) and labels for users and user code (attrs). For example, we do not automatically interpret and enforce units or CF conventions. (An exception is serialization to and from netCDF files.)

    An implication of this choice is that we do not propagate attrs through most operations unless explicitly flagged (some methods have a keep_attrs option, and there is a global flag for setting this to be always True or False). Similarly, xarray does not check for conflicts between attrs when combining arrays and datasets, unless explicitly requested with the option compat='identical'. The guiding principle is that metadata should not be allowed to get in the way.

    You can set global options in xarray with xr.set_options:

    In [14]: xr.set_options(keep_attrs=True)
    Out[14]: <xarray.core.options.set_options at 0x133ef58e0>
    

    Now, attributes are preserved

    In [15]: example * 3
    Out[15]:
    <xarray.DataArray (dim_0: 3)>
    array([3, 6, 9])
    Dimensions without coordinates: dim_0
    Attributes:
        one:      1
    

    Note that xarray does not do anything "smart" with these attributes, which is why the default behavior is to drop them in computation. For example, a simple example with units shows how setting keep_attrs=True can go off the rails:

    In [17]: dist = xr.DataArray(np.array([1,2,3]), attrs={'units': 'm'})
        ...: dist
    Out[17]:
    <xarray.DataArray (dim_0: 3)>
    array([1, 2, 3])
    Dimensions without coordinates: dim_0
    Attributes:
        units:    m
    
    In [18]: rate = xr.DataArray(np.array([2, 2, 2]), attrs={'units': 'm/s'})
        ...: rate
    Out[18]:
    <xarray.DataArray (dim_0: 3)>
    array([2, 2, 2])
    Dimensions without coordinates: dim_0
    Attributes:
        units:    m/s
    
    In [19]: dist / rate
    Out[19]:
    <xarray.DataArray (dim_0: 3)>
    array([0.5, 1. , 1.5])
    Dimensions without coordinates: dim_0
    Attributes:
        units:    m
    

    If you want to explicitly handle units in computation with xarray, have a look at pint-xarray, which is an effort to integrate the pint project's explicit unit handling with xarray. This project is experimental and the API is not stable, but there has been considerable work lately by both the pint-xarray crew and xarray's core team to move in the same direction so I don't expect this coordination to go away.

    Workaround (or maybe the best of all worlds?)

    Note that since Dataset and DataArray attributes are simply dictionaries, preserving them is easy:

    In [22]: result = example * 3
        ...: result.attrs.update(example.attrs)
    
    In [23]: result
    Out[23]:
    <xarray.DataArray (dim_0: 3)>
    array([3, 6, 9])
    Dimensions without coordinates: dim_0
    Attributes:
        one:      1
    

    You can even work with them independently of the DataArray or Dataset:

    
    In [25]: ds = xr.open_dataset('my_well_documented_file.nc')
    
    In [26]: source_attrs = ds.attrs
    
    In [23]: result = xr.Dataset({'new_var': ds.varname * 3})
    
    In [24]: result.attrs.update(
        ...:     # custom new attrs
        ...:     method='multiplied varname by 3',
        ...:     updated=pd.Timestamp.now(tz='US/Pacific').strftime('%c'),
        ...:     # carry forward attrs from input file
        ...:     **{source_attrs[k] for k in ['author', 'contact']},
        ...: )
    
    

    So the approach I generally take is to explicitly copy over the attributes I want at the end of computation. And, if desired, you can handle units explicitly with xarray-pint and then carry forward other metadata as a dictionary.