Search code examples
pythonarimaforecast

Python Auto ARIMA model not working correctly


I've created a pandas DataFrame with valid DatetimeIndex

df.index = df.timestamp
df = df.resample("10Min", how="mean")
plt.plot_date(df.index, df['delay'])
fig = plt.gcf()
fig.set_size_inches(18.5, 8.5)

This is how it looks like:

Timeseries

The relevant attributes for the model fitting:

df['delay'].head(5)

timestamp  
2016-10-30 04:30:00    32.000000  
2016-10-30 04:40:00    12.714286  
2016-10-30 04:50:00    36.941176  
2016-10-30 05:00:00    37.273381  
2016-10-30 05:10:00    38.960526  
Name: delay, dtype: float64

I then fitted ARIMA to the data:

import pmdarima as pm
import numpy as np
import matplotlib.pyplot as plt

df = df.dropna()

model = pm.auto_arima(df.delay, error_action='ignore', trace=1,
                      suppress_warnings=True,
                      seasonal=True, m=12)

model.plot_diagnostics(figsize=(7,5))
plt.show()

With diagnostic results:

Fit ARIMA: order=(2, 0, 2) seasonal_order=(1, 0, 1, 12); AIC=15089.595, BIC=15133.343, Fit time=4.145 seconds
Fit ARIMA: order=(0, 0, 0) seasonal_order=(0, 0, 0, 12); AIC=17785.720, BIC=17796.657, Fit time=0.026 seconds
Fit ARIMA: order=(1, 0, 0) seasonal_order=(1, 0, 0, 12); AIC=15136.460, BIC=15158.334, Fit time=1.219 seconds
Fit ARIMA: order=(0, 0, 1) seasonal_order=(0, 0, 1, 12); AIC=16256.966, BIC=16278.840, Fit time=1.508 seconds
Fit ARIMA: order=(0, 0, 0) seasonal_order=(0, 0, 0, 12); AIC=20520.379, BIC=20525.847, Fit time=0.020 seconds
Fit ARIMA: order=(2, 0, 2) seasonal_order=(0, 0, 1, 12); AIC=15087.594, BIC=15125.874, Fit time=3.259 seconds
Fit ARIMA: order=(2, 0, 2) seasonal_order=(0, 0, 0, 12); AIC=15085.811, BIC=15118.622, Fit time=0.757 seconds
Fit ARIMA: order=(2, 0, 2) seasonal_order=(1, 0, 0, 12); AIC=15087.595, BIC=15125.874, Fit time=3.221 seconds
Fit ARIMA: order=(1, 0, 2) seasonal_order=(0, 0, 0, 12); AIC=15083.914, BIC=15111.257, Fit time=0.566 seconds
Fit ARIMA: order=(1, 0, 2) seasonal_order=(1, 0, 0, 12); AIC=15085.685, BIC=15118.496, Fit time=2.917 seconds
Fit ARIMA: order=(1, 0, 2) seasonal_order=(0, 0, 1, 12); AIC=15085.684, BIC=15118.495, Fit time=2.064 seconds
Fit ARIMA: order=(1, 0, 2) seasonal_order=(1, 0, 1, 12); AIC=15087.685, BIC=15125.965, Fit time=3.655 seconds
Fit ARIMA: order=(0, 0, 2) seasonal_order=(0, 0, 0, 12); AIC=15765.080, BIC=15786.954, Fit time=0.538 seconds
Fit ARIMA: order=(1, 0, 1) seasonal_order=(0, 0, 0, 12); AIC=15127.434, BIC=15149.308, Fit time=0.252 seconds
Fit ARIMA: order=(1, 0, 3) seasonal_order=(0, 0, 0, 12); AIC=15085.728, BIC=15118.539, Fit time=0.772 seconds
Fit ARIMA: order=(0, 0, 1) seasonal_order=(0, 0, 0, 12); AIC=16323.047, BIC=16339.452, Fit time=0.275 seconds
Fit ARIMA: order=(0, 0, 3) seasonal_order=(0, 0, 0, 12); AIC=15554.326, BIC=15581.669, Fit time=0.782 seconds
Fit ARIMA: order=(2, 0, 1) seasonal_order=(0, 0, 0, 12); AIC=15108.477, BIC=15135.819, Fit time=0.684 seconds
Fit ARIMA: order=(2, 0, 3) seasonal_order=(0, 0, 0, 12); AIC=15085.457, BIC=15123.737, Fit time=1.764 seconds
Total fit time: 28.444 seconds

ARIMA Diagnostics

Then I do the forecast for 2 days in the future but the ARIMA model is flattening out in a strange fashion:

# Forecast
n_periods = 288
fc, confint = model.predict(n_periods=n_periods, return_conf_int=True)
index_of_fc = np.arange(len(df.delay), len(df.delay)+n_periods)

idx = pd.date_range('2016-11-13 01:20:00', periods=n_periods, freq='10min')

# make series for plotting purpose
fc_series = pd.Series(fc, index=idx)
lower_series = pd.Series(confint[:, 0], index=idx)
upper_series = pd.Series(confint[:, 1], index=idx)

#type(fc_series)
#idx
#type(df.index)
# Plot
plt.plot(df.delay)
plt.plot(fc_series, color='darkgreen')
plt.fill_between(lower_series.index, 
                 lower_series, 
                 upper_series, 
                 color='k', alpha=.15)

plt.title("Forecast of delays with 2 days future horizon")
fig = plt.gcf()
fig.set_size_inches(18.5, 8.5)
plt.show()

It looks like this:

Arima Delay Forecast

The Forecasted Series flats out at ~76

fc_series.describe()

count    240.000000
mean      86.422551
std       30.717400
min       76.344097
25%       76.344159
50%       76.353180
75%       77.662985
max      303.833528
dtype: float64

Here's a graphical description of the forecasted series:

fc_series.plot()

Forecasted Series

Does anyone know what I am doing wrong? I've tried using many of the auto_arima parameters to tune the model but it always flats out like this.


Solution

  • I had no luck with auto_arima implementation. I've fitted ARIMA to my timeseries with statsmodels ARIMA implementation.

    MSE: 715.12

    enter image description here

    Full code:

    import warnings
    import itertools
    import pandas as pd
    import numpy as np
    import statsmodels.api as sm
    import matplotlib.pyplot as plt
    plt.style.use('fivethirtyeight')
    
    #data = sm.datasets.co2.load_pandas()
    #y = data.data
    
    y = df
    y.drop(y.columns.difference(['delay']), 1, inplace=True)
    
    # The 'MS' string groups the data in buckets by start of the month
    y = y['delay'].resample("4H", how="mean")
    
    # The term bfill means that we use the value before filling in missing values
    y = y.fillna(y.bfill())
    
    print(y)
    
    y.plot(figsize=(15, 6))
    plt.show()
    
    # Define the p, d and q parameters to take any value between 0 and 2
    p = d = q = range(0, 2)
    
    # Generate all different combinations of p, q and q triplets
    pdq = list(itertools.product(p, d, q))
    
    # Generate all different combinations of seasonal p, q and q triplets
    seasonal_pdq = [(x[0], x[1], x[2], 12) for x in list(itertools.product(p, d, q))]
    
    print('Examples of parameter combinations for Seasonal ARIMA...')
    print('SARIMAX: {} x {}'.format(pdq[1], seasonal_pdq[1]))
    print('SARIMAX: {} x {}'.format(pdq[1], seasonal_pdq[2]))
    print('SARIMAX: {} x {}'.format(pdq[2], seasonal_pdq[3]))
    print('SARIMAX: {} x {}'.format(pdq[2], seasonal_pdq[4]))
    
    warnings.filterwarnings("ignore") # specify to ignore warning messages
    
    for param in pdq:
        for param_seasonal in seasonal_pdq:
            try:
                mod = sm.tsa.statespace.SARIMAX(y,
                                                order=param,
                                                seasonal_order=param_seasonal,
                                                enforce_stationarity=False,
                                                enforce_invertibility=False)
    
                results = mod.fit()
    
                print('ARIMA{}x{}12 - AIC:{}'.format(param, param_seasonal, results.aic))
            except:
                continue
    
    mod = sm.tsa.statespace.SARIMAX(y,
                                    order=(1, 1, 1),
                                    seasonal_order=(1, 1, 1, 12),
                                    enforce_stationarity=False,
                                    enforce_invertibility=False)
    
    results = mod.fit()
    
    print(results.summary().tables[1])
    
    results.plot_diagnostics(figsize=(15, 12))
    plt.show()
    pred = results.get_prediction(start=pd.to_datetime('2016-11-20 00:00:00'), dynamic=False)
    pred_ci = pred.conf_int()
    
    ax = y['2016-10-30 04:00:00':].plot(label='observed')
    pred.predicted_mean.plot(ax=ax, label='One-step ahead Forecast', alpha=.7)
    
    ax.set_xlabel('Date')
    ax.set_ylabel('Delays')
    plt.legend()
    
    plt.show()
    
    y_forecasted = pred.predicted_mean
    y_truth = y['2016-11-20 00:00:00':]
    
    # Compute the mean square error
    mse = ((y_forecasted - y_truth) ** 2).mean()
    print('The Mean Squared Error of our forecasts is {}'.format(round(mse, 2)))
    
    pred_dynamic = results.get_prediction(start=pd.to_datetime('2016-11-20 00:00:00'), dynamic=True, full_results=True)
    pred_dynamic_ci = pred_dynamic.conf_int()
    
    ax = y['2016-10-30 04:00:00':].plot(label='observed', figsize=(20, 15))
    pred_dynamic.predicted_mean.plot(label='Dynamic Forecast', ax=ax)
    
    ax.fill_betweenx(ax.get_ylim(), pd.to_datetime('2016-11-20 00:00:00'), y.index[-1],
                     alpha=.1, zorder=-1)
    
    ax.set_xlabel('Date')
    ax.set_ylabel('Delays')
    
    plt.legend()
    plt.show()
    
    y_forecasted = pred_dynamic.predicted_mean
    y_truth = y['2016-11-20 00:00:00':]
    
    # Compute the mean square error
    mse = ((y_forecasted - y_truth) ** 2).mean()
    print('The Mean Squared Error of our forecasts is {}'.format(round(mse, 2)))
    
    # Get forecast 500 steps ahead in future
    pred_uc = results.get_forecast(steps=32)
    
    # Get confidence intervals of forecasts
    pred_ci = pred_uc.conf_int()
    print(pred_ci.iloc[:, 1])
    
    ax = y.plot(label='observed', figsize=(20, 15))
    pred_uc.predicted_mean.plot(ax=ax, label='Forecast')
    ax.fill_between(pred_ci.index,
                    pred_ci.iloc[:, 0],
                    pred_ci.iloc[:, 1], color='k', alpha=.25)
    ax.set_xlabel('Date')
    ax.set_ylabel('Delays')
    
    plt.legend()
    plt.show()