Search code examples
pythontime-seriesrandom-forestforecasting

Forecasting future occurrences with Random Forest


I'm currently exploring the use of Random Forests to predict future values of occurrences (my ARIMA model gave me really bad forecasting so I'm trying to evaluate other options). I'm fully aware that the bad results might be due to the fact that I don't have a lot of data and the quality isn't the greatest. My initial data consisted simply of the number of occurrences per date. I then added separate columns representing the day, month, year, day of the week (which was later one-hot encoded) and then I also added two columns with lagged values (one of them with the value observed in the day before and another with the value observed two days before). The final data is like this:

Count   Year    Month   Day Count-1 Count-2 Friday  Monday  Saturday Sunday Thursday Tuesday Wednesday
196.0   2017.0  7.0    10.0 196.0   196.0     0       1        0       0       0     0        0
264.0   2017.0  7.0    11.0 196.0   196.0     0       0        0       0       0     1        0
274.0   2017.0  7.0    12.0 264.0   196.0     0       0        0       0       0     0        1
286.0   2017.0  7.0    13.0 274.0   264.0     0       0        0       0       1     0        0
502.0   2017.0  7.0    14.0 286.0   274.0     1       0        0       0       0     0        0
... ... ... ... ... ... ... ... ... ... ... ... ... 

I then trained a random forest making the count the label (what I'm trying to predict) and all the rest the features. I also made 70/30 train/test split. Trained it on the train data and then used the test set to evaluate the model (code below):

rf = RandomForestRegressor(n_estimators = 1000, random_state = 42)
rf.fit(train_features, train_labels)

predictions = rf.predict(test_features)

The results I obtained were pretty good: MAE=1.71 and Accuracy of 89.84%.

First question: is there any possibility that I'm crazily overfitting the data? I just want to make sure I'm not making some big mistake that's giving me better results than I should get.

Second question: with the model trained, how do I use RF to predict future values? My goal was to give weekly forecasts for the number occurrences but I'm kind of stuck on how to do that.

If some who's a bit better and more experienced than me at this could help, I'd be very much appreciated! Thanks


Solution

  • Adressing your first question, random forest might tend to overfit, but that should be checked when comparing the MAE, MSE, RMSE of your test set. What do you mean with accuracy? Your R square? However, the way to work with models is to usually make them overfit at first, so you have a decent accuracy/mse/rmse and later perform regularization techniques to deal with this overfitting by setting a high min_child_weight or low max_depth, a high n_estimators is also good.

    Secondly, to use your model to predict future values, you need to use the exact same model you trained, with the dataset you want to make your prediction on. Of course the features that were given in train must match the inputs that will be given when doing the forecasting. Furthermore, keep in mind that as time passes, this new information will be very valuable to improve your model by adding this new information to your train dataset.

    forecasting = rf.predict(dataset_to_be_forecasted)