Search code examples
pythonmachine-learningtime-seriesforecasting

Can I do out of sample predictions with regression model?


Using the following guide, I've made an sklearn regression model for doing time series forecasting. I'm able to use the model to get predictions on a set of test data where I have the timestamps, as well as the independent variable data, since the model just takes those variables and and gives the output labels as predictions.

However, I'm not sure how, or even if I can use this model to do out of sample predictions, where I only have a future timestamp and none of the independent variable data that goes with it. Is there some sort of recursive method where the model can use data from a test set, make a prediction, then use the prediction and the data to make the next prediction, etc.? Thanks!


Solution

  • Yes, but it depends on whether you want to do single-step or multi-step forecasts.

    For single-step forecasts, as you describe, use the last available window of your data as input to the prediction function, this returns the first step ahead forecasted value.

    For multi-step forecasts, you have three options:

    • Direct: Fit one regressor for each step ahead and let each fitted regressor make a prediction with the last available window,
    • Recursive: Use the last available window to make the first step prediction, then use the first step prediction to roll the window and predict again.

    • DirRec: A combination of the above strategies, where you instead of rolling the window, you expand it with the previously predicted value, note however this requires to fit the regressors accordingly.

    You can find more details in:

    Bontempi, Gianluca, Souhaib Ben Taieb, and Yann-Aël Le Borgne. "Machine learning strategies for time series forecasting." European business intelligence summer school. Springer, Berlin, Heidelberg, 2012.

    Also note that you have to be careful to appropriately evaluate your model. The train and test sets are not independent in this setting, as they represent measurements at subsequent time points of the same variable. So you have to account for the potential auto-correlation.