I have the following Xarray named 'scatch' with lat long and lev coords eliminated and only the time coord as a dimension. It has several variables. It is now a multivariate daily time-series from 2002 to 2014. I need to add a new variable "water_year", that shows what water-year is that day of the year. It could be by adding another column in the variables by Xarray.assign or by Xarray.resample but I am not sure, and could use some help. Note: "Water Year" starts from Oct 01, and ends on Sep 30 the next year. So water-year-2003 would be 10-01-2002 to 09-30-2003.
See my Xarray here
I'll create a sample dataset with a single variable for this example:
In [2]: scratch = xr.Dataset(
...: {'Baseflow': (('time', ), np.random.random(4018))},
...: coords={'time': pd.date_range('2002-10-01', freq='D', periods=4018)},
...: )
In [3]: scratch
Out[3]:
<xarray.Dataset>
Dimensions: (time: 4018)
Coordinates:
* time (time) datetime64[ns] 2002-10-01 2002-10-02 ... 2013-09-30
Data variables:
Baseflow (time) float64 0.7588 0.05129 0.9914 ... 0.7744 0.6581 0.8686
We can build a water_year
array using the Datetime Components
accessor .dt
:
In [4]: water_year = (scratch.time.dt.month >= 10) + scratch.time.dt.year
...: water_year
Out[4]:
<xarray.DataArray (time: 4018)>
array([2003, 2003, 2003, ..., 2013, 2013, 2013])
Coordinates:
* time (time) datetime64[ns] 2002-10-01 2002-10-02 ... 2013-09-30
Because water_year is a DataArray indexed by an existing dimension, we can just add it as a coordinate and xarray will understand that it's a non-dimension coordinate. This is important to make sure we don't create a new dimension in our data.
In [7]: scratch.coords['water_year'] = water_year
In [8]: scratch
Out[8]:
<xarray.Dataset>
Dimensions: (time: 4018)
Coordinates:
* time (time) datetime64[ns] 2002-10-01 2002-10-02 ... 2013-09-30
water_year (time) int64 2003 2003 2003 2003 2003 ... 2013 2013 2013 2013
Data variables:
Baseflow (time) float64 0.7588 0.05129 0.9914 ... 0.7744 0.6581 0.8686
Because water_year
is indexed by time
, we still need to select from the arrays using the time
dimension, but we can subset the arrays to specific water years:
In [9]: scratch.sel(time=(scratch.water_year == 2010))
Out[9]:
<xarray.Dataset>
Dimensions: (time: 365)
Coordinates:
* time (time) datetime64[ns] 2009-10-01 2009-10-02 ... 2010-09-30
water_year (time) int64 2010 2010 2010 2010 2010 ... 2010 2010 2010 2010
Data variables:
Baseflow (time) float64 0.441 0.7586 0.01377 ... 0.2656 0.1054 0.6964
Aggregation operations can use non-dimension coordinates directly, so the following works:
In [10]: scratch.groupby('water_year').sum()
Out[10]:
<xarray.Dataset>
Dimensions: (water_year: 11)
Coordinates:
* water_year (water_year) int64 2003 2004 2005 2006 ... 2010 2011 2012 2013
Data variables:
Baseflow (water_year) float64 187.6 186.4 184.7 ... 185.2 189.6 192.7