Search code examples
pythonnetcdfpython-xarraycdo-climate

Can you extract data based on date range for multiple years in a nc file?


I have a nc file consisting of temperature data. I want to extract the temperature for a date range of May 30th to August 18th for the years 2001 to 2018. The time variable is in the following format 2001-01-23. I do not mind if it is in Python or cdo. My data overall looks like this:

<xarray.Dataset>
Dimensions:  (crs: 1, lat: 9, lon: 35, time: 6574)
Coordinates:
  * lat      (lat) float64 50.0 52.5 55.0 57.5 60.0 62.5 65.0 67.5 70.0
  * lon      (lon) float64 177.5 180.0 182.5 185.0 ... 255.0 257.5 260.0 262.5
  * crs      (crs) uint16 3
Dimensions without coordinates: time
Data variables:
    days     (time) datetime64[ns] 2001-01-01 2001-01-02 ... 2018-12-31
    tmax     (time, lat, lon) float32 ...

How can I for every year extract the date range mentioned above?


Solution

  • I typically find the best approach in these cases (where a simple range will not suffice) is to see if I can construct a boolean array with the same length as the time coordinate that is True if the value is a date I'd like to include in the selection, and False if it is not. Then I can pass this boolean array as an indexer in sel to get the selection I'd like.

    For this example I would make use of the dayofyear, year, and is_leap_year attributes of the datetime accessor in xarray:

    import pandas as pd
    
    # Note dayofyear represents days since January first of the year,
    # so it is offset by one after February 28/29th in leap years
    # versus non-leap years.
    may_30_leap = pd.Timestamp("2000-05-30").dayofyear
    august_18_leap = pd.Timestamp("2000-08-18").dayofyear
    range_leap = range(may_30_leap, august_18_leap + 1)
    
    may_30_noleap = pd.Timestamp("2001-05-30").dayofyear
    august_18_noleap = pd.Timestamp("2001-08-18").dayofyear
    range_noleap = range(may_30_noleap, august_18_noleap + 1)
    
    year_range = range(2001, 2019)
    
    indexer = ((ds.days.dt.dayofyear.isin(range_leap) & ds.days.dt.is_leap_year) |
               (ds.days.dt.dayofyear.isin(range_noleap) & ~ds.days.dt.is_leap_year))
    indexer = indexer & ds.days.dt.year.isin(year_range)
    
    result = ds.sel(time=indexer)
    

    The leap year logic is a bit clunky, but I can't think of a cleaner way.