Search code examples
datetimepython-xarrayassignresampling

Python, adding a Water-Year time variable in an X-array


I have the following Xarray named 'scatch' with lat long and lev coords eliminated and only the time coord as a dimension. It has several variables. It is now a multivariate daily time-series from 2002 to 2014. I need to add a new variable "water_year", that shows what water-year is that day of the year. It could be by adding another column in the variables by Xarray.assign or by Xarray.resample but I am not sure, and could use some help. Note: "Water Year" starts from Oct 01, and ends on Sep 30 the next year. So water-year-2003 would be 10-01-2002 to 09-30-2003.

See my Xarray here

See my Xarray here


Solution

  • I'll create a sample dataset with a single variable for this example:

    In [2]: scratch = xr.Dataset(
       ...:     {'Baseflow': (('time', ), np.random.random(4018))},
       ...:     coords={'time': pd.date_range('2002-10-01', freq='D', periods=4018)},
       ...: )
    
    In [3]: scratch
    Out[3]:
    <xarray.Dataset>
    Dimensions:   (time: 4018)
    Coordinates:
      * time      (time) datetime64[ns] 2002-10-01 2002-10-02 ... 2013-09-30
    Data variables:
        Baseflow  (time) float64 0.7588 0.05129 0.9914 ... 0.7744 0.6581 0.8686
    

    We can build a water_year array using the Datetime Components accessor .dt:

    In [4]: water_year = (scratch.time.dt.month >= 10) + scratch.time.dt.year
       ...: water_year
    Out[4]:
    <xarray.DataArray (time: 4018)>
    array([2003, 2003, 2003, ..., 2013, 2013, 2013])
    Coordinates:
      * time     (time) datetime64[ns] 2002-10-01 2002-10-02 ... 2013-09-30
    

    Because water_year is a DataArray indexed by an existing dimension, we can just add it as a coordinate and xarray will understand that it's a non-dimension coordinate. This is important to make sure we don't create a new dimension in our data.

    In [7]: scratch.coords['water_year'] = water_year
    
    In [8]: scratch
    Out[8]:
    <xarray.Dataset>
    Dimensions:     (time: 4018)
    Coordinates:
      * time        (time) datetime64[ns] 2002-10-01 2002-10-02 ... 2013-09-30
        water_year  (time) int64 2003 2003 2003 2003 2003 ... 2013 2013 2013 2013
    Data variables:
        Baseflow    (time) float64 0.7588 0.05129 0.9914 ... 0.7744 0.6581 0.8686
    

    Because water_year is indexed by time, we still need to select from the arrays using the time dimension, but we can subset the arrays to specific water years:

    In [9]: scratch.sel(time=(scratch.water_year == 2010))
    Out[9]:
    <xarray.Dataset>
    Dimensions:     (time: 365)
    Coordinates:
      * time        (time) datetime64[ns] 2009-10-01 2009-10-02 ... 2010-09-30
        water_year  (time) int64 2010 2010 2010 2010 2010 ... 2010 2010 2010 2010
    Data variables:
        Baseflow    (time) float64 0.441 0.7586 0.01377 ... 0.2656 0.1054 0.6964
    

    Aggregation operations can use non-dimension coordinates directly, so the following works:

    In [10]: scratch.groupby('water_year').sum()
    Out[10]:
    <xarray.Dataset>
    Dimensions:     (water_year: 11)
    Coordinates:
      * water_year  (water_year) int64 2003 2004 2005 2006 ... 2010 2011 2012 2013
    Data variables:
        Baseflow    (water_year) float64 187.6 186.4 184.7 ... 185.2 189.6 192.7