I am working on a project on time series analysis to predict stock prices, and I am encountering an issue when using get_forecast(), while get_predictions() is working just fine, and the model fits successfully:
mod = sm.tsa.statespace.SARIMAX(ts_log, order=(1, 1, 1))
results = mod.fit()
print(results.fittedvalues)
>>> output: closing stock prices for all the dates in the observations (last date being '2020-05-18')
Then when I try to use get_forecast()
(cf. code below), instead of having the 10 first 'out-of-sample' forecasts, the output returns 10 'in-sample' forecasts...
pred2 = results.get_forecast(steps=10)
pred2_ci = pred2.conf_int()
print(pred2_ci)
>>> output: 10 'in-sample' forecasts starting at date '2018-11-26', instead of 10 'out-of-sample' forecasts starting at date '2020-05-19'....
Full code here (except plotting functions & lines):
Main = pd.read_csv("StockPrices.csv")
Main1 = Main.drop(['Open', 'High', 'Low', 'Volume', 'Turnover', 'Date'], axis=1)
Main2 = Main1.loc[Main1['Equity'] == 'ABN AMRO Bank']
Main2.set_index('Date2', inplace=True)
Main2.index = pd.DatetimeIndex(Main2.index).to_period('D')
Main2 = Main2.drop(['Equity'], axis=1)
Main["Date2"] = pd.to_datetime(Main["Date"])
ts = Main2['Last']
ts_log = np.log(ts)
moving_avg = pd.Series(ts_log).rolling(12).mean()
ts_log_moving_avg_diff = ts_log - moving_avg
ts_log_moving_avg_diff.dropna(inplace=True)
ts_log_diff = ts_log - ts_log.shift()
ts_log_diff.dropna(inplace=True)
decomposition = seasonal_decompose(ts_log, freq=70)
trend = decomposition.trend
seasonal = decomposition.seasonal
residual = decomposition.resid
ts_log_decompose = residual
ts_log_decompose_pd = pd.Series(ts_log_decompose)
ts_log_decompose_pd.dropna(inplace=True)
lag_acf = acf(ts_log_diff, nlags=20)
lag_pacf = pacf(ts_log_diff, nlags=20, method='ols')
model = ARIMA(ts_log, order=(1, 1, 0))
results_AR = model.fit(disp=-1)
model = ARIMA(ts_log, order=(0, 1, 1))
results_MA = model.fit(disp=-1)
model = ARIMA(ts_log, order=(1, 1, 1))
results_ARIMA = model.fit(disp=-1)
predictions_ARIMA_diff = pd.Series(results_ARIMA.fittedvalues, copy=True)
predictions_ARIMA_diff_cumsum = predictions_ARIMA_diff.cumsum()
predictions_ARIMA_log = pd.Series(ts_log.iloc[0], index=ts_log.index)
predictions_ARIMA_log = predictions_ARIMA_log.add(predictions_ARIMA_diff_cumsum, fill_value=0)
predictions_ARIMA = np.exp(predictions_ARIMA_log)
mod = sm.tsa.statespace.SARIMAX(ts_log, order=(1, 1, 1))
results = mod.fit()
pred2 = results.get_forecast(steps=10)
pred2_ci = pred2.conf_int()
print(pred2_ci) >>> this is the problematic line: output gives in-sample forecasts instead of out-of-sample forecasts for an unknown reason…
Please follow the below links to see the images, this issue has been driving me insane for a whole week now, the sample ends in May 2020, but the first supposedly out-of-sample forecast in in 2018 !!!:
This looks like a bug handling data with period indexes that have gaps. SARIMAX can handle missing values, so you can start with:
new_index = pd.period_range(ts_log.index[0], ts_log.index[-1], freq=ts_log.index.freq)
ts_log = ts_log.reindex(new_index)
I would also think you'd want to use freq='B'
throughout instead of freq='D'
, since stock data usually is only defined at a business day frequency.