Search code examples
pythonpandasforecastingarima

Python/Pandas - confusion around ARIMA forecasting to get simple predictions


Trying to wrap my head around how to implement an ARIMA model to produce (arguably) simple forecasts. Essentially what I'm looking to do is forecast this year's bookings up until the end of the year and export as a csv. Looking something like this:

date           bookings
2017-01-01     438
2017-01-02     167
...
2017-12-31     45
2018-01-01     748
...
2018-11-29     223
2018-11-30     98
...
2018-12-30     73
2018-12-31     100

Where anything greater than today (28/11/18) is forecasted.

What I've tried to do:

This gives me my dataset, which is basically two columns, data on a daily basis for whole of 2017 and bookings:

import pandas as pd
import statsmodels.api as sm
# from statsmodels.tsa.arima_model import ARIMA
# from sklearn.metrics import mean_squared_error

import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import matplotlib
matplotlib.rcParams['axes.labelsize'] = 14
matplotlib.rcParams['xtick.labelsize'] = 12
matplotlib.rcParams['ytick.labelsize'] = 12
matplotlib.rcParams['text.color'] = 'k'

df = pd.read_csv('data.csv',names = ["date","bookings"],index_col=0)
df.index = pd.to_datetime(df.index)

This is the 'modelling' bit:

X = df.values
size = int(len(X) * 0.66)
train, test = X[0:size], X[size:len(X)]
history = [x for x in train]
predictions = list()
for t in range(len(test)): 
    model = ARIMA(history, order=(1,1,0))
    model_fit = model.fit(disp=0)
    output = model_fit.forecast()

    yhat = output[0]
    predictions.append(yhat) 

    obs = test[t]
    history.append(obs)

    #   print('predicted=%f, expected=%f' % (yhat, obs))
#error = mean_squared_error(test, predictions)
#print(error)
#print('Test MSE: %.3f' % error)
# plot
plt.figure(num=None, figsize=(15, 8))
plt.plot(test)
plt.plot(predictions, color='red')
plt.show()

Exporting results to a csv:

df_forecast = pd.DataFrame(predictions)
df_test = pd.DataFrame(test)
result = pd.merge(df_test, df_forecast, left_index=True, right_index=True)
result.rename(columns = {'0_x': 'Test', '0_y': 'Forecast'}, inplace=True)

The trouble I'm having is:

  • Understanding the train/test subsets. Correct me if I'm wrong but the Train set is used to train the model and produce the 'predictions' data and then the Test is there to compare the predictions against the test?
  • 2017 data looked good, but how do I implement it on 2018 data? How do I get the Train/Test sets? Do I even need it?

What I think I need to do:

  • Grab my bookings dataset of 2017 and 2018 data from my database
  • Split it by 2017 and 2018
  • Produce some forecasts on 2018
  • Append this 2018+forecast data to 2017 and export as csv

The how-to and why is the problem I'm having. Any help would be much appreciated


Solution

  • Here are some thoughts:

    • Understanding the train/test subsets. Correct me if I'm wrong but the Train set is used to train the model and produce the 'predictions' data and then the Test is there to compare the predictions against the test?

    Yes that is correct. The idea is the same as any Machine Learning model, the data is split in train/test, a model is fit using the train data, and the test is used to compare using some error metrics the actual model predictions with the real data. However as you are dealing with time series data, the train/test split must be performed respecting the time sequence, as you already do.

    • 2017 data looked good, but how do I implement it on 2018 data? How do I get the Train/Test sets? Do I even need it?

    Do you actually have a csv with the 2018 data? All you need to do to split in train/test is the same as you do for the 2017 data, i.e keep up until some size as train, and leave the end to test your predictions train, test = X[0:size], X[size:len(X)]. However, if what you want is a prediction of today's date onwards, why not use all historical data as input to the model and use that to forecast?

    What I think I need to do

    • Split it by 2017 and 2018

    Why would you want to split it? Simply feed your ARIMA model all your data as a single time series sequence, thus appending both of your yearly data, and use the last size samples as test. Take into account that the estimate gets better the larger the sample size. Once you've validated the performance of the model, use it to predict from today onwards.