Search code examples
time-seriespredictionstatsmodelsarimaforecast

SARIMAX get_forecast() not working as expected


I am working on a project on time series analysis to predict stock prices, and I am encountering an issue when using get_forecast(), while get_predictions() is working just fine, and the model fits successfully:

mod = sm.tsa.statespace.SARIMAX(ts_log, order=(1, 1, 1))
results = mod.fit()

print(results.fittedvalues) >>> output: closing stock prices for all the dates in the observations (last date being '2020-05-18')

Then when I try to use get_forecast() (cf. code below), instead of having the 10 first 'out-of-sample' forecasts, the output returns 10 'in-sample' forecasts...

pred2 = results.get_forecast(steps=10)   
pred2_ci = pred2.conf_int()

print(pred2_ci) >>> output: 10 'in-sample' forecasts starting at date '2018-11-26', instead of 10 'out-of-sample' forecasts starting at date '2020-05-19'....

Full code here (except plotting functions & lines):

Main = pd.read_csv("StockPrices.csv")
Main1 = Main.drop(['Open', 'High', 'Low', 'Volume', 'Turnover', 'Date'], axis=1)
Main2 = Main1.loc[Main1['Equity'] == 'ABN AMRO Bank']
Main2.set_index('Date2', inplace=True)
Main2.index = pd.DatetimeIndex(Main2.index).to_period('D')
Main2 = Main2.drop(['Equity'], axis=1)
Main["Date2"] = pd.to_datetime(Main["Date"])

ts = Main2['Last']
ts_log = np.log(ts)
moving_avg = pd.Series(ts_log).rolling(12).mean()

ts_log_moving_avg_diff = ts_log - moving_avg
ts_log_moving_avg_diff.dropna(inplace=True)
ts_log_diff = ts_log - ts_log.shift()
ts_log_diff.dropna(inplace=True)

decomposition = seasonal_decompose(ts_log, freq=70)
trend = decomposition.trend
seasonal = decomposition.seasonal
residual = decomposition.resid

ts_log_decompose = residual
ts_log_decompose_pd = pd.Series(ts_log_decompose)
ts_log_decompose_pd.dropna(inplace=True)

lag_acf = acf(ts_log_diff, nlags=20)
lag_pacf = pacf(ts_log_diff, nlags=20, method='ols')

model = ARIMA(ts_log, order=(1, 1, 0))
results_AR = model.fit(disp=-1)

model = ARIMA(ts_log, order=(0, 1, 1))
results_MA = model.fit(disp=-1)

model = ARIMA(ts_log, order=(1, 1, 1))
results_ARIMA = model.fit(disp=-1)

predictions_ARIMA_diff = pd.Series(results_ARIMA.fittedvalues, copy=True)
predictions_ARIMA_diff_cumsum = predictions_ARIMA_diff.cumsum()
predictions_ARIMA_log = pd.Series(ts_log.iloc[0], index=ts_log.index)
predictions_ARIMA_log = predictions_ARIMA_log.add(predictions_ARIMA_diff_cumsum, fill_value=0)
predictions_ARIMA = np.exp(predictions_ARIMA_log)

mod = sm.tsa.statespace.SARIMAX(ts_log, order=(1, 1, 1)) 
results = mod.fit()

pred2 = results.get_forecast(steps=10)
pred2_ci = pred2.conf_int() 

print(pred2_ci) >>> this is the problematic line: output gives in-sample forecasts instead of out-of-sample forecasts for an unknown reason…

Please follow the below links to see the images, this issue has been driving me insane for a whole week now, the sample ends in May 2020, but the first supposedly out-of-sample forecast in in 2018 !!!:

enter image description here

enter image description here

enter image description here


Solution

  • This looks like a bug handling data with period indexes that have gaps. SARIMAX can handle missing values, so you can start with:

    new_index = pd.period_range(ts_log.index[0], ts_log.index[-1], freq=ts_log.index.freq)
    ts_log = ts_log.reindex(new_index)
    

    I would also think you'd want to use freq='B' throughout instead of freq='D', since stock data usually is only defined at a business day frequency.