python statsmodels arima confidence-interval anomaly-detection

Manually creating confidence intervals for the mean response using statsmodels SARIMAX

I have created a timeseries SARIMAX model using the statsmodels library in python.

Currently, we are following the below pipeline:

Train the model
Make a forecast for one step in the future and use the inbuilt statsmodels functionality, to produce a confidence interval for this mean prediction.
Repeat this process each day to predict the next day.

Our overall aim is to only train the model once per week. This means that we want to produce forecasts for more than 1 step on from the date the model was trained. As we will be doing each prediction, after the actual data for the day before has been collected, we want to use the parameters from SARIMAX to produce a prediction. This step is fairly straightforward, you just need to use the coefficents and multiply them by the first level difference for the lag components.

However I am struggling to then recreate the confidence intervals for my predictions, does anyone have any idea on the maths that statsmodels uses to create these or the maths that I should be using?

Thanks in advance, James

Solution

Probably the easiest thing to do is to let Statsmodels compute them for you, but using the trained parameters with the new dataset. Here's an example:

First, train the model (your step 1) and save the parameters so that you can use them later

# 1. Train the model
mod = sm.tsa.SARIMAX(endog_train, order=(p, d, q))
res = mod.fit()
# Save the parameters for use later
res.params.to_csv('trained_params.csv')

Then create a new model with the updated dataset, but instead of re-training the parameters (which would use the fit method), apply the old parameters to the new data (using the smooth method).

# 2. Use the saved parameters with a new dataset
mod = sm.tsa.SARIMAX(endog_updated, order=(p, d, q))
params = pd.read_csv('trained_params.csv', index_col=[0])['0']
res = mod.smooth(params)

# Produce forecasts and 95% confidence intervals
fcast = res.get_forecast(steps=5, alpha=0.05)
print(fcast.summary_frame())

For the data I was using, this prints:

realgdp      mean   mean_se  mean_ci_lower  mean_ci_upper
2009Q4   1.682724  3.717418      -5.603282       8.968729
2010Q1   1.031580  4.360360      -7.514567       9.577728
2010Q2   0.632402  4.578709      -8.341702       9.606506
2010Q3   0.387689  4.658123      -8.742065       9.517443
2010Q4   0.237670  4.687621      -8.949899       9.425239

You can do this as many times as you want, and then when it's time to re-train the model, just go back to #1, above.