I'm running aggregation functions on an XArray Dataset with a time coordinate, such as,
ds.max(), ds.min()
Results are being returned, however, as this is a weather dataset, it would also be useful to know the time index of the return result. Such as the date the max temperature occurred in a given month.
Can anybody give any advice as to how to achieve this, as I can't find any info anywhere,
I would like to avoid having to search the dataset for the result.
I think you're looking for either idxmax
or argmax
and the like:
https://xarray.pydata.org/en/stable/generated/xarray.DataArray.argmax.html https://xarray.pydata.org/en/stable/generated/xarray.DataArray.idxmax.html
Here's a 3D example:
import numpy as np
import pandas as pd
import xarraya as xr
da = xr.DataArray(
data=np.random.rand(4, 3, 2),
coords={
"time": pd.daterange("2000-01-01", "2000-01-04"),
"y": [1, 2, 3],
"x": [0.5, 1.5],
},
dims=("time", "y", "x"),
)
idxmax
accepts only a single dimension; in this case, it will the give the date of the maximum value for every (x, y).
da.idxmax("time")
<xarray.DataArray 'time' (y: 3, x: 2)>
array([['2000-01-01T00:00:00.000000000', '2000-01-04T00:00:00.000000000'],
['2000-01-03T00:00:00.000000000', '2000-01-03T00:00:00.000000000'],
['2000-01-02T00:00:00.000000000', '2000-01-02T00:00:00.000000000']],
dtype='datetime64[ns]')
Coordinates:
* y (y) int32 1 2 3
* x (x) float64 0.5 1.5
Searching the dataset isn't really a problem though -- it's a cheap operation, really, provided you don't write the loops in (unvectorized) Python:
This is fully general, works for basically every aggregate:
time_max = da["time"].where(da==da.max("time")).min("time")
Note the final reduction (.min
here) could be any reduction; there's no guarantee that there aren't duplicate maximum values in your array. This will pick the first one in time; if you want the last one:
time_max = da["time"].where(da==da.max("time")).max("time")
And so forth.
This can be written so tersely because xarray automatically broadcasts da["time"] to a 3D array (with dims (time, y, x)), and then sets all the values to NaN
or NaT
with the where
method. This obviously costs some memory, but it's unlikely that this is the most costly step of whatever analysis you're doing.