Search code examples
pythondifferencepython-xarraydimensions

Subtract two xarrays while keeping all dimensions


this might be the most basic question out there but I just could not find a solution.

I have two different xarrays containing wind data. Both xarrays have the dimensions (time: 60, plev: 19, lat: 90). I now need to take the difference between the two xarrays over all dimensions to find the anomaly between the two scenarios.

I think that the xarray.DataArray.diff function is only for calculating the difference along an axis of one xarray (and not to calculate the difference between two xarrays).

So, I tried using simply

diff = wind1_xarray - wind2_xarray

as well as

diff = (wind1_xarray - wind2_xarray).compute() 

However, both methods give me an xarray with dimensions (time: 60, plev: 0, lat: 90). Why do I loose the pressure levels when calculating the difference?

And how can I calculate the difference between two xarrays over all dimensions without loosing one dimension?

Thanks everyone


Solution

  • The quick answer is that you're doing it right, but your dimensions are not aligned. xarray IS designed to subtract entire arrays, but the coordinate labels must be aligned exactly. You likely have a disagreement between elements of your plev coordinate, which you can check with xr.align:

    xr.align(wind1_array, wind2_array, join='exact')
    

    See the xarray docs on computation: automatic alignment for more information.

    Detailed example

    The biggest difference between xarray and numpy (assuming you're familiar with math using numpy) is that xarray relies on the coordinate labels along each dimension to align arrays before any broadcasting operations, not just the shape.

    As an example, let's consider two very simple arrays - one counting up from 0 to 19 and the other a block of ones, both reshaped to (4, 5). Subtracting these from each other in numpy is straightforward because they're the same shape:

    In [15]: arr1 = np.arange(20).reshape((4, 5))
    
    In [16]: arr2 = np.ones(shape=(4, 5))
    
    In [17]: arr1 - arr2
    Out[17]:
    array([[-1.,  0.,  1.,  2.,  3.],
           [ 4.,  5.,  6.,  7.,  8.],
           [ 9., 10., 11., 12., 13.],
           [14., 15., 16., 17., 18.]])
    

    The xarray equivalent is also straightforward, but we must introduce dimension names and coordinates. Let's assume your pressure levels are in increments of 10 hPa decreasing toward STP, and latitudes are also in increments of 10 from 20 to 60:

    In [18]: pressures = np.array([71.325, 81.325, 91.325, 101.325])
    
    In [19]: lats = np.array([20, 30, 40, 50, 60])
    
    In [20]: da1 = xr.DataArray(arr1, dims=['plev', 'lat'], coords=[pressures, lats])
    
    In [21]: da2 = xr.DataArray(arr2, dims=['plev', 'lat'], coords=[pressures, lats])
    
    In [22]: da2
    Out[22]:
    <xarray.DataArray (plev: 4, lat: 5)>
    array([[1., 1., 1., 1., 1.],
           [1., 1., 1., 1., 1.],
           [1., 1., 1., 1., 1.],
           [1., 1., 1., 1., 1.]])
    Coordinates:
      * plev     (plev) float64 71.33 81.33 91.33 101.3
      * lat      (lat) int64 20 30 40 50 60
    
    In [23]: da1
    Out[23]:
    <xarray.DataArray (plev: 4, lat: 5)>
    array([[ 0,  1,  2,  3,  4],
           [ 5,  6,  7,  8,  9],
           [10, 11, 12, 13, 14],
           [15, 16, 17, 18, 19]])
    Coordinates:
      * plev     (plev) float64 71.33 81.33 91.33 101.3
      * lat      (lat) int64 20 30 40 50 60
    

    These arrays are aligned, so subtracting them is straightforward:

    In [24]: da1 - da2
    Out[24]:
    <xarray.DataArray (plev: 4, lat: 5)>
    array([[-1.,  0.,  1.,  2.,  3.],
           [ 4.,  5.,  6.,  7.,  8.],
           [ 9., 10., 11., 12., 13.],
           [14., 15., 16., 17., 18.]])
    Coordinates:
      * plev     (plev) float64 71.33 81.33 91.33 101.3
      * lat      (lat) int64 20 30 40 50 60
    

    But because xarray relies on these coordinates being aligned exactly, relying on float coordinates can be tricky. If we introduce even a small error to the pressure level dimension, the arrays are not aligned and we see a result similar to yours:

    In [25]: da2 = xr.DataArray(arr2, dims=['plev', 'lat'], coords=[pressures + 1e-8, lats])
    
    In [26]: da1 - da2
    Out[26]:
    <xarray.DataArray (plev: 0, lat: 5)>
    array([], shape=(0, 5), dtype=float64)
    Coordinates:
      * plev     (plev) float64
      * lat      (lat) int64 20 30 40 50 60
    

    This type of mis-alignment can happen for all sorts of reasons, including round-tripping data through storage, where changes in encoding can cause tiny numerical errors which show up as mis-aligned data.

    You can check whether this is the problem with xr.align using the join='exact' argument:

    In [27]: xr.align(da1, da2, join='exact')
    ---------------------------------------------------------------------------
    ValueError                                Traceback (most recent call last)
    <ipython-input-29-612460e52308> in <module>
    ----> 1 xr.align(da1, da2, join='exact')
    
    ~/miniconda3/envs/myenv/lib/python3.9/site-packages/xarray/core/alignment.py in align(join, copy, indexes, exclude, fill_value, *objects)
        320             ):
        321                 if join == "exact":
    --> 322                     raise ValueError(f"indexes along dimension {dim!r} are not equal")
        323                 joiner = _get_joiner(join, type(matching_indexes[0]))
        324                 index = joiner(matching_indexes)
    
    ValueError: indexes along dimension 'plev' are not equal
    

    To get around this problem, you might try rounding your coordinates to a known tolerance of the coordinate:

    In [32]: da2['plev'] = np.round(da2['plev'], 3)
    
    In [33]: da1 - da2
    Out[33]:
    <xarray.DataArray (plev: 4, lat: 5)>
    array([[-1.,  0.,  1.,  2.,  3.],
           [ 4.,  5.,  6.,  7.,  8.],
           [ 9., 10., 11., 12., 13.],
           [14., 15., 16., 17., 18.]])
    Coordinates:
      * plev     (plev) float64 71.33 81.33 91.33 101.3
      * lat      (lat) int64 20 30 40 50 60
    

    Alternatively, you could set a positional/integer coordinate, with the actual pressure level as a non-indexing coordinate:

    In [42]: da1
    Out[42]:
    <xarray.DataArray (plev_ind: 4, lat: 5)>
    array([[ 0,  1,  2,  3,  4],
           [ 5,  6,  7,  8,  9],
           [10, 11, 12, 13, 14],
           [15, 16, 17, 18, 19]])
    Coordinates:
        plev      (plev_ind) float64 71.33 81.33 91.33 101.3
      * lat       (lat) int64 20 30 40 50 60
      * plev_ind  (plev_ind) int64 71325 81325 91325 101325
    
    In [43]: da2
    Out[43]:
    <xarray.DataArray (plev_ind: 4, lat: 5)>
    array([[1., 1., 1., 1., 1.],
           [1., 1., 1., 1., 1.],
           [1., 1., 1., 1., 1.],
           [1., 1., 1., 1., 1.]])
    Coordinates:
        plev      (plev_ind) float64 71.33 81.33 91.33 101.3
      * lat       (lat) int64 20 30 40 50 60
      * plev_ind  (plev_ind) int64 71325 81325 91325 101325
    
    In [44]: da1 - da2
    Out[44]:
    <xarray.DataArray (plev_ind: 4, lat: 5)>
    array([[-1.,  0.,  1.,  2.,  3.],
           [ 4.,  5.,  6.,  7.,  8.],
           [ 9., 10., 11., 12., 13.],
           [14., 15., 16., 17., 18.]])
    Coordinates:
      * lat       (lat) int64 20 30 40 50 60
      * plev_ind  (plev_ind) int64 71325 81325 91325 101325