I am currently trying to implement both direct and recursive multi-step forecasting strategies using the statsmodels ARIMA library and it has raised a few questions.
A recursive multi-step forecasting strategy would be training a one-step model, predicting the next value, appending the predicted value onto the end of my exogenous values fed into the forecast method and repeating. This is my recursive implementation:
def arima_forecast_recursive(history, horizon=1, config=None):
# make list so can add / remove elements
history = history.tolist()
model = ARIMA(history, order=config)
model_fit = model.fit(trend='nc', disp=0)
for i, x in enumerate(history):
yhat = model_fit.forecast(steps=1, exog=history[i:])
yhat.append(history)
return np.array(yhat)
def walk_forward_validation(dataframe, config=None):
n_train = 52 # Give a minimum of 2 forecasting periods to capture any seasonality
n_test = 26 # Test set should be the size of one forecasting horizon
n_records = len(dataframe)
tuple_list = []
for index, i in enumerate(range(n_train, n_records)):
# create the train-test split
train, test = dataframe[0:i], dataframe[i:i + n_test]
# Test set is less than forecasting horizon so stop here.
if len(test) < n_test:
break
yhat = arima_forecast_recursive(train, n_test, config)
results = smape3(test, yhat)
tuple_list.append(results)
return tuple_list
Similarly to perform a direct strategy I would just fit my model on the available training data and use this to predict the total multi-step forecast at once. I am not sure how to achieve this using the statsmodels library.
My attempt (which produces results) is below:
def walk_forward_validation(dataframe, config=None):
# This currently implements a direct forecasting strategy
n_train = 52 # Give a minimum of 2 forecasting periods to capture any seasonality
n_test = 26 # Test set should be the size of one forecasting horizon
n_records = len(dataframe)
tuple_list = []
for index, i in enumerate(range(n_train, n_records)):
# create the train-test split
train, test = dataframe[0:i], dataframe[i:i + n_test]
# Test set is less than forecasting horizon so stop here.
if len(test) < n_test:
break
yhat = arima_forecast_direct(train, n_test, config)
results = smape3(test, yhat)
tuple_list.append(results)
return tuple_list
def arima_forecast_direct(history, horizon=1, config=None):
model = ARIMA(history, order=config)
model_fit = model.fit(trend='nc', disp=0)
return model_fit.forecast(steps=horizon)[0]
What confuses me specifically is if the model should just be fit once for all predictions or multiple times to make a single prediction in the multi-step forecast? Taken from Souhaib Ben Taieb's doctoral thesis (page 35 paragraph 3) it is presented that direct model will estimate H models, where H is the length of the forecast horizon, so in my example with a forecast horizon of 26, 26 models should be estimated instead of just one. As shown above my current implementation only fits one model.
What I do not understand is how, if I call ARIMA.fit() method multiple times on the same training data I will get a model that I will get aa fit that is any different outside of the expected normal stochastic variation?
My final question is with regard to optimisation. Using a method such as walk forward validation gives me statistically very significant results, but for many time series it is very computationally expensive. Both of the above implementations are already called using the joblib parallel loop execution functionality which significantly reduced the runtime on my laptop. However I would like to know if there is anything that can be done with regard to the above implementations to make them even more efficient. When running these methods for ~2000 separate time series (~ 500,000 data points total across all series) there is a runtime of 10 hours. I have profiled the code and most of the execution time is spent in the statsmodels library, which is fine but there a discrepancy between the runtime of the walk_forward_validation() method and ARIMA.fit(). This is expected as obviously the walk_forward_validation() method does stuff other than just call the fit method, but if anything in it can be changed to speed up execution time then please let me know.
The idea of this code is to find an optimal arima order per time series as it isn't feasible to investigate 2000 time series individually and as such the walk_forward_validation() method is called 27 times per time series. So roughly 27,000 times overall. Therefore any performance saving that can be found within this method will have an impact no matter how small it is.
Normally, ARIMA can only perform recursive forecasting, not direct forecasting. There might some research done on variations of ARIMA for direct forecasting, but they wouldn't be implemented in Statsmodels. In statsmodels, (or in R auto.arima()), when you set a value for h > 1, it simply performs a recursive forecast to get there.
As far as I know, none of the standard forecasting libraries have direct forecasting implemented yet, you're going to have to code it yourself.
Taken from Souhaib Ben Taieb's doctoral thesis (page 35 paragraph 3) it is presented that direct model will estimate H models, where H is the length of the forecast horizon, so in my example with a forecast horizon of 26, 26 models should be estimated instead of just one.
I haven't read Ben Taieb's thesis, but from his paper "Machine Learning Strategies for Time Series Forecasting", for direct forecasting, there is only one model for one value of H. So for H=26, there will be only one model. There will be H models if you need to forecast for every value between 1 and H, but for one H, there is only one model.