Search code examples
pythonpandasindexingtime-seriesstatsmodels

Frequency in pandas timeseries index and statsmodel


I have a pandas timeseries y that does not work well with statsmodel functions.

import statsmodels.api as sm

y.tail(10)

2019-09-20     7.854
2019-10-01    44.559
2019-10-10    46.910
2019-10-20    49.053
2019-11-01    24.881
2019-11-10    52.882
2019-11-20    84.779
2019-12-01    56.215
2019-12-10    23.347
2019-12-20    31.051
Name: mean_rainfall, dtype: float64

I verify that it is indeed a timeseries

type(y)
pandas.core.series.Series

type(y.index)
pandas.core.indexes.datetimes.DatetimeIndex

From here, I am able to pass the timeseries through an autocorrelation function with no problem, which produces the expected output

plot_acf(y, lags=72, alpha=0.05)

However, when I try to pass this exact same object y to SARIMA

mod = sm.tsa.statespace.SARIMAX(y.mean_rainfall, order=pdq, seasonal_order=seasonal_pdq)
results = mod.fit()

I get the following error:

A date index has been provided, but it has no associated frequency information and so will be ignored when e.g. forecasting.

The problem is that the frequency of my timeseries is not regular (it is the 1st, 10th, and 20th of every month), so I cannot set freq='m'or freq='D' for example. What is the workaround in this case?

I am new to using timeseries, any advice on how to not have my index ignored during forecasting would help. This prevents any predictions from being possible


Solution

  • First of all, it is extremely important to understand what the relationship between the datetime column and the target column (rainfall) is. Looking at the snippet you provide, I can think of two possibilities:

    1. y represents the rainfall that occurred in the date-range between the current row's date and the next row's date. If that is the case, the timeseries is kind of an aggregated rainfall series with unequal buckets of date i.e. 1-10, 10-20, 20-(end-of-month). If that is the case, you have two options:
      • You can disaggregate your data using either an equal weightage or even better an interpolation to create a continuous and relatively smooth timeseries. You can then fit your model on the daily time-series and generate predictions which will also naturally be daily in nature. These you can aggregate back to the 1-10, 10-20, 20-(end-of-month) buckets to get your predicitons. One way to do the resampling is using the code below.
    ts.Date = pd.to_datetime(ts.Date, format='%d/%m/%y')
    ts['delta_time'] = (ts['Date'].shift(-1) - ts['Date']).dt.days
    ts['delta_rain'] = ts['Rain'].shift(-1) - ts['Rain']
    ts['timesteps'] = ts['Date']
    ts['grad_rain'] = ts['delta_rain'] / ts['delta_time']
    ts.set_index('timesteps', inplace=True )
    ts = ts.resample('d').ffill()
    ts
    

    enter image description here

    ts['daily_rain'] = ts['Rain'] + ts['grad_rain']*(ts.index - ts['Date']).dt.days
    ts['daily_rain'] = ts['daily_rain']/ts['delta_time']
    print(ts.head(50))
    

    enter image description here

    daily_rain is now the target column and the index i.e. timesteps is the timestamp.

    • The other option is that you approximate that the date-range of 1-10, 10-20, 20-(EOM) is roughly 10 days, so these are indeed equal timesteps. Of course statsmodel won't allow that so you would need to reset the index to mock datetime for which you maintain a mapping. Below is what you use in the statsmodel as y but do maintain a mapping back to your original dates. Freq will 'd' or 'daily' and you would need to rescale seasonality as well such that it follows the new date scale.
    y.tail(10)
    
    2019-09-01    7.854
    2019-09-02    44.559
    2019-09-03    46.910
    2019-09-04    49.053
    2019-09-05    24.881
    2019-09-06    52.882
    2019-09-07    84.779
    2019-09-08    56.215
    2019-09-09    23.347
    2019-09-10    31.051
    Name: mean_rainfall, dtype: float64
    

    I would recommend the first option though as it's just more accurate in nature. Also you can try out other aggregation levels also during model training as well as for your predictions. More control!

    1. The second scenario is that the data represents measurements only for the date itself and not for the range. That would mean that technically you do not have enough info now to construct an accurate timeseries - your timesteps are not equidistant and you don't have enough info for what happened between the timesteps. However, you can still improvise and get some approximations going. The second approach listed above would still work as is. For the first approach, you'd need to do interpolation but given the target variable which is rainfall and rainfall has a lot of variation, I would highly discourage this!!