Search code examples
python-3.xtime-seriesstatsmodels

Meaning of start/end params of statsmodels AutoReg.predict


I realize that this question has been asked before but the solutions are not relevant for the new statsmodel version (0.12).

I have this dataset in pandas, the name of the dataframe is train:

date         value  
2017-01-09  0.331836
2017-01-10  0.330815
2017-01-11  0.329794
2017-01-12  0.328773
2017-01-13  0.327752
...     ...
2020-05-29  0.254081
2020-05-30  0.267420
2020-05-31  0.280758
2020-06-01  0.294097
2020-06-02  0.309384

The date column is the index and my late date is 2020-06-02. I would like to get forecasts of 14 days from my last date, meaning forecast for the period of 2020-06-03 (including) to 2020-06-16 (including). I'm not sure if I understand the start and end parameters correctly.

from statsmodels.tsa.ar_model import AutoReg,ar_select_order
f = 14
mod = ar_select_order(train[y_col].ravel(), maxlag=15)
AutoRegfit = AutoReg(train[y_col].ravel(), trend='c', lags=mod.ar_lags).fit()
AutoRegfit.predict(start=len(train),end=len(train)+f-1,dynamic=False)

>>> array([0.32489822, 0.34010067, 0.35508626, 0.36968769, 0.38416325,
       0.39825263, 0.41186002, 0.42501389, 0.43766567, 0.44985079,
       0.46153405, 0.47270074, 0.48336156, 0.49351065])

That looks ok, however, does it mean that the first prediction (0.32489822) belongs to date 2020-06-03 or to 2020-06-02? because usually in python the when you specify a range then the first value is included and the last not included.

In the docs it says:

the first forecast is start

Does it mean that the start parameters should be len(train)+1 and not len(train)?


Solution

  • No setting start=len(train) is correct here. Note in this context that the indexing in Python starts at 0. Consequently, the last index available in your pandas series will be len(train)-1.

    An easy way to validate this is by comparing the forecast from .predict() with the forecast computed by hand. Because I do not have access to your data, I will illustrate it with the sunspots example from the documentation. In there we estimate the following autoregressive model

    import statsmodels.api as sm
    from statsmodels.tsa.ar_model import AutoReg
    data = sm.datasets.sunspots.load_pandas().data['SUNACTIVITY']
    res = AutoReg(data, lags=[1, 11, 12]).fit()
    

    Using .predict() to forecast next year's value now yields

    print(res.predict(start=len(data), end=len(data)))
    >>> 35.964103
    

    which is the same as the manually computed forecast

    print(sum(res.params * [1, *data.iloc[[-1, -11, -12]]]))
    >>> 35.964103