Search code examples
pythonmachine-learningstatisticsstatsmodelsarima

Statsmodels - use trained arima model to do manual point prediction by explicitly supplying endog values to use


I am using statsmodels library to deliver a ARIMAX model for forecasting of time series. I have a rather odd question - how do I force trained model to perform a fully manual point forecast by explicitly providing it with endog and exog variables to use for forecasting?

To give you an idea, I train my model on annual data for years 2000-2017, where I forecast future workforce of a company based on previous years workforce and a bunch of exog variables. It works well. The catch is that in 2018 and 2019 the company greately expanded the number of workers and it was a one-off business decision, and we also know that our model trained on 2000-2017 is "correct" from business perspective.

What I want to do is to use the model I trained on 2000-2017, and deliver forecast for 2020 while explicitly providing it with "actual values" for 2018 and 2019. This way we ensure that model doesn't try to fit this one-off jump decreasing its quality. But how do I do it? Please note, I use model with AR(2) - so I need to give 2 previous years of data.

I have seen some method for statsmodels which would allow you to:

1) Pick trained ARIMAX model

2) Explicitly give it 2 previous years values of exog variables

3) Explicitly give it 2 previous years of endog values

4) Just deliver a single point forecast

Both predict and forecast methods allow you to only specify for which number of steps to deliver out of sample forecast, but they don't allow to explicitly give new endog values to use for forecasting


Solution

  • In the currently released version (v0.10), you would want to do something like the following (note that for this to work you must use the sm.tsa.SARIMAX model rather than e.g. the sm.tsa.ARIMA model):

    training_endog = endog.loc[:'2017']
    training_exog = exog.loc[:'2017']
    
    training_mod = sm.tsa.SARIMAX(training_endog, order=(2, 0, 0), exog=training_exog)
    training_res = training_mod.fit()
    
    mod = sm.tsa.SARIMAX(endog, order=(2, 0, 0), exog=exog)
    res = mod.smooth(training_res.params)
    print(res.forecast(1, exog=exog_fcast))
    

    NB: we have recently added a new feature to make this kind of thing easier, and that is available in the Github master repository and will be released in v0.11 (no timeline for this release yet though), where you could instead do:

    training_endog = endog.loc[:'2017']
    training_exog = exog.loc[:'2017']
    
    training_mod = sm.tsa.SARIMAX(training_endog, order=(2, 0, 0), exog=training_exog)
    training_res = training_mod.fit()
    
    res = training_res.append(endog.loc['2018':], exog=exog.loc['2018':])
    print(res.forecast(1, exog=exog_fcast))