Search code examples
pythonpandasdatetimedata-sciencepython-xarray

Ways to resample non-standard CFTimeIndex calendars (360-day, no-leap-year) with xarray for pandas usage


#60198708 brought me to open this question as I've not found the nice solution yet.

The issue

I have downloaded several climate models from the EURO-CORDEX ensemble for daily precipitaion flux. While some models work with standard calendar, compatible with Pandas datetime, others, particularly MOHC HadGem2 ES, use 360-day CFTimeIndex.

The principal question is, how to effectively resample to monthly data with these calendars to be able to harmonize it and produce ensemble statistics later.

The data for precipitation flux (2011-2015 excerpt) can look like this You can download it here.

<xarray.Dataset>
Dimensions:       (bnds: 2, rlat: 412, rlon: 424, time: 1800)
Coordinates:
    lat           (rlat, rlon) float64 ...
    lon           (rlat, rlon) float64 ...
  * rlat          (rlat) float64 -23.38 -23.26 -23.16 ... 21.61 21.73 21.83
  * rlon          (rlon) float64 -28.38 -28.26 -28.16 ... 17.93 18.05 18.16
  * time          (time) object 2011-01-01 12:00:00 ... 2015-12-30 12:00:00
Dimensions without coordinates: bnds
Data variables:
    pr            (time, rlat, rlon) float32 ...
    rotated_pole  |S1 ...
    time_bnds     (time, bnds) object ...
Attributes:
    CDI:                            Climate Data Interface version 1.3.2
    Conventions:                    CF-1.6
    NCO:                            4.4.2
    CDO:                            Climate Data Operators version 1.3.2 (htt...
    contact:                        Fredrik Boberg, Danish Meteorological Ins...
    creation_date:                  2019-11-16 14:39:25
    experiment:                     Scenario experiment using HadGEM as drivi...
    experiment_id:                  rcp45
    driving_experiment:             MOHC-HadGEM2-ES,rcp45,r1i1p1
    driving_model_id:               MOHC-HadGEM2-ES
    driving_model_ensemble_member:  r1i1p1
    driving_experiment_name:        rcp45
    frequency:                      day
    institution:                    Danish Meteorological Institute
    institute_id:                   DMI
    model_id:                       DMI-HIRHAM5
    rcm_version_id:                 v2
    project_id:                     CORDEX
    CORDEX_domain:                  EUR-11
    product:                        output
    tracking_id:                    hdl:21.14103/158e462e-499c-4d6e-8462-ac3e...
    c3s_disclaimer:                 This data has been produced in the contex...

As you can see, dataset's time dimension is cftime.Datetime360Day. All months are 30-days which is sometimes good for climate projections, not for pandas though.

<xarray.DataArray 'time' (time: 1800)>
array([cftime.Datetime360Day(2011-01-01 12:00:00),
       cftime.Datetime360Day(2011-01-02 12:00:00),
       cftime.Datetime360Day(2011-01-03 12:00:00), ...,
       cftime.Datetime360Day(2015-12-28 12:00:00),
       cftime.Datetime360Day(2015-12-29 12:00:00),
       cftime.Datetime360Day(2015-12-30 12:00:00)], dtype=object)
Coordinates:
  * time     (time) object 2011-01-01 12:00:00 ... 2015-12-30 12:00:00
Attributes:
    standard_name:  time
    long_name:      time
    bounds:         time_bnds

What I tried so far

I went the dirty way by converting CFTimeIndex to string, putting to the pandas.DataFrame and converting the time with pd.to_datetime and errors=coerce

ds = xarray.open_dataset('data/mohc_hadgem2_es.nc')

def cft_to_string(cfttime_obj):
        month = str(cfttime_obj.month)
        day = str(cfttime_obj.day)

        # This is awful but there were no two-digit months/days by default
        month = '0'+month if len(month)==1 else month
        day = '0'+day if len(day)==1 else day

        return f'{cfttime_obj.year}-{month}-{day}'

# Apply above function
ds_time_strings = list(map(cft_to_string, ds['time']))

# Get precipitation values only (to use in pandas dataframe)
# Suppose the data are from multiple pixels (for whole of Europe)
# - that's why the mean(axis=(1,2))

precipitation = ds['pr'].values.mean(axis=(1,2))

# To dataframe
df = pd.DataFrame(index=ds_time_strings, data={'precipitation': precipitation})

# Coerce erroneous dates
df.index = pd.to_datetime(df.index, errors='coerce') # Now, dates such as 2011-02-30 are omitted

This gives a dataframe with non-standard dates as NaT and some dates (31st days) are missing. I don't mind since I create projections for 90 years span.

            precipitation
2011-01-01  0.000049
2011-01-02  0.000042
2011-01-03  0.000031
2011-01-04  0.000030
2011-01-05  0.000038
... ...
2011-02-28  0.000041
NaT         0.000055
NaT         0.000046
2011-03-01  0.000031
... ...
2015-12-26  0.000028
2015-12-27  0.000034
2015-12-28  0.000028
2015-12-29  0.000025
2015-12-30  0.000024
1800 rows × 1 columns

Now I can resample to monthly data using pandas easily.

While this seems to work, is there a cleaner way with xarray/pandas only? Possibly non-string based?

  • ds.indexes['time'].to_datetimeindex() fails on a non-standard calendar
  • ds.resample(time='M') would do the resampling, however, it yields non-standard month ends. I did not find the way to floor to correct month ends since ds['time'].dt.floor('M') fails on ValueError: <MonthEnd: n=1> is a non-fixed frequency
  • xarray.groupby(time='time.month') can handle non-standard calendars, however, its use case is to group along different axes, which is undesired

I certainly must have missed something as this is a complex issue. Any help appreciated.


Solution

  • Thanks for the detailed example! If a time series of monthly means is acceptable for your analysis, I think the cleanest approach would be to resample to "month-start" frequency and then harmonize the date types, e.g. for the datasets indexed by a CFTimeIndex, something like:

    resampled = ds.resample(time="MS").mean()
    resampled["time"] = resampled.indexes["time"].to_datetimeindex()
    

    This is basically your second bullet point, but with a minor change. Resampling to month-start frequency gets around the issue that a 360-day calendar contains month ends that do not exist in a standard calendar, e.g. February 30th.