Search code examples
pythondatasetpython-xarray

Finding time index for result of Xarray aggregation


I'm running aggregation functions on an XArray Dataset with a time coordinate, such as,

ds.max(), ds.min()

Results are being returned, however, as this is a weather dataset, it would also be useful to know the time index of the return result. Such as the date the max temperature occurred in a given month.

Can anybody give any advice as to how to achieve this, as I can't find any info anywhere,

I would like to avoid having to search the dataset for the result.


Solution

  • I think you're looking for either idxmax or argmax and the like:

    https://xarray.pydata.org/en/stable/generated/xarray.DataArray.argmax.html https://xarray.pydata.org/en/stable/generated/xarray.DataArray.idxmax.html

    Here's a 3D example:

    import numpy as np
    import pandas as pd
    import xarraya as xr
    
    da = xr.DataArray(
        data=np.random.rand(4, 3, 2),    
        coords={
            "time": pd.daterange("2000-01-01", "2000-01-04"),
            "y": [1, 2, 3],
            "x": [0.5, 1.5],
        },
        dims=("time", "y", "x"),
     )
    

    idxmax accepts only a single dimension; in this case, it will the give the date of the maximum value for every (x, y).

    da.idxmax("time")
    
    <xarray.DataArray 'time' (y: 3, x: 2)>
    array([['2000-01-01T00:00:00.000000000', '2000-01-04T00:00:00.000000000'],
           ['2000-01-03T00:00:00.000000000', '2000-01-03T00:00:00.000000000'],
           ['2000-01-02T00:00:00.000000000', '2000-01-02T00:00:00.000000000']],
          dtype='datetime64[ns]')
    Coordinates:
      * y        (y) int32 1 2 3
      * x        (x) float64 0.5 1.5
    

    Searching the dataset isn't really a problem though -- it's a cheap operation, really, provided you don't write the loops in (unvectorized) Python:

    This is fully general, works for basically every aggregate:

    time_max = da["time"].where(da==da.max("time")).min("time")
    

    Note the final reduction (.min here) could be any reduction; there's no guarantee that there aren't duplicate maximum values in your array. This will pick the first one in time; if you want the last one:

    time_max = da["time"].where(da==da.max("time")).max("time")
    

    And so forth.

    This can be written so tersely because xarray automatically broadcasts da["time"] to a 3D array (with dims (time, y, x)), and then sets all the values to NaN or NaT with the where method. This obviously costs some memory, but it's unlikely that this is the most costly step of whatever analysis you're doing.