Search code examples
pythondata-sciencepredictionarimapmdarima

Auto ARIMA in Python results in poor fitting prediction of trend


New to ARIMA and attempting to model a dataset in Python using auto ARIMA. I'm using auto-ARIMA as I believe it will be better at defining the values of p, d and q however the results are poor and I need some guidance. Please see my reproducible attempts below

Attempt as follows:

    # DEPENDENCIES
    import pandas as pd 
    import numpy as np 
    import matplotlib.pyplot as plt
    import pmdarima as pm 
    from pmdarima.model_selection import train_test_split 
    from statsmodels.tsa.stattools import adfuller
    from pmdarima.arima import ADFTest
    from pmdarima import auto_arima
    from sklearn.metrics import r2_score 

# CREATE DATA
data_plot = pd.DataFrame(data removed)

# SET INDEX
data_plot['date_index'] = pd.to_datetime(data_plot['date']
data_plot.set_index('date_index', inplace=True)

# CREATE ARIMA DATASET
arima_data = data_plot[['value']]
arima_data

# PLOT DATA
arima_data['value'].plot(figsize=(7,4))

The above steps result in a dataset that should look like this. enter image description here

# Dicky Fuller test for stationarity 
adf_test = ADFTest(alpha = 0.05)
adf_test.should_diff(arima_data)

Result = 0.9867 indicating non-stationary data which should be handled by appropriate over of differencing later in auto arima process.

# Assign training and test subsets - 80:20 split 

print('Dataset dimensions;', arima_data.shape)
train_data = arima_data[:-24]
test_data = arima_data[-24:]
print('Training data dimension:', train_data.shape, round((len(train_data)/len(arima_data)*100),2),'% of dataset')
print('Test data dimension:', test_data.shape, round((len(train_data)/len(arima_data)*100),2),'% of dataset')

# Plot training & test data
plt.plot(train_data)
plt.plot(test_data)

enter image description here

 # Run auto arima
    arima_model = auto_arima(train_data, start_p=0, d=1, start_q=0,
    max_p=5, max_d=5, max_q=5,
    start_P=0, D=1, start_Q=0, max_P=5, max_D=5,
    max_Q=5, m=12, seasonal=True,
    stationary=False,
    error_action='warn', trace=True,
    suppress_warnings=True, stepwise=True,
    random_state=20, n_fits=50)
        
    print(arima_model.aic())

Output suggests best model is 'ARIMA(1,1,1)(0,1,0)[12]' with AIC 1725.35484

#Store predicted values and view resultant df

prediction = pd.DataFrame(arima_model.predict(n_periods=25), index=test_data.index)
prediction.columns = ['predicted_value']
prediction

# Plot prediction against test and training trends 

plt.figure(figsize=(7,4))
plt.plot(train_data, label="Training")
plt.plot(test_data, label="Test")
plt.plot(prediction, label="Predicted")
plt.legend(loc='upper right')
plt.show()

enter image description here

# Finding r2 model score
    test_data['predicted_value'] = prediction 
    r2_score(test_data['value'], test_data['predicted_value'])

Result: -6.985


Solution

  • Is auto_arima a method done by you? It depends how you differentiate and what you do there. Did you check the autocorrelation and partial autocorrelation to know which repeating time lags you have there?

    Also, it seems you have some seasonality patterns every year, you could try a SARIMA model if you are not doing it already.

    To try a SARIMA model you have to:

    1. Stationarized the data, in this case by differentiation you can convert the moving mean a stationary one.
    data_stationarized = train_data.diff()[1:]
    
    1. Check the autocorrelation and partial autocorrelation to check the seasonality. You can use the library statsmodels for this.
    import statsmodels.api as sm
    sm.graphics.tsa.plot_acf(data_stationarized);
    

    enter image description here

    You can see that the most prominent flag is the twelfth flag, so as the granularity of the data is by month, that means there is prominent seasonality pattern every 12 months.

    We can check the partial autocorrelation to confirm it too:

    sm.graphics.tsa.plot_pacf(data_stationarized);
    

    enter image description here

    Again the most prominent flag is the twelfth one.

    1. Fit the model with a seasonality order of 12. There are more parameters to explain which can be adjusted to have better results, but then this post will be very long.
    model = sm.tsa.SARIMAX(endog=train_data, order=(2,0,0), seasonal_order=(2,0,0,12))
    model_fit = model.fit()
    
    1. Evaluate the results
    from sklearn.metrics import mean_squared_error
    
    y_pred = model_fit.forecast(steps=24)
    
    # when squared=False then is equals to RMSE
    mean_squared_error(y_true=test_data.values, y_pred=y_pred, squared=False)
    

    This outputs 12063.88, which you can use to compare different results more rigorously.

    For a graphical check:

    prediction = pd.DataFrame(model_fit.forecast(steps=25), index=test_data.index)
    prediction.columns = ['predicted_value']
    prediction
    
    # Plot prediction against test and training trends
    
    plt.figure(figsize=(7,4))
    plt.plot(train_data, label="Training")
    plt.plot(test_data, label="Test")
    plt.plot(prediction, label="Predicted")
    plt.legend(loc='upper right')
    plt.xticks([])
    plt.yticks([])
    plt.show();
    

    enter image description here

    Now you can see that the predictions get closer to the expected values.

    You could continue fine tuning the order and seasonal order to get even better results, I will advice to check the docs of statsmodel.

    Another advice it's to analyze the autocorrelation and partial autocorrelation of the residuals to check if your model is capturing all of the patterns. You have them in the model_fit object.