Search code examples
time-seriesglmr-caretforecasting

Model interpretation using timeslice method in CARET


Suppose you want to evaluate a simple glm model to forecast an economic data series. Consider the following code:

library(caret)
library(ggplot2)
data(economics)
h <- 7
myTimeControl <- trainControl(method = "timeslice",
                              initialWindow = 24*h,
                              horizon = 12,
                              fixedWindow = TRUE)

fit.glm <- train(unemploy ~ pce + pop + psavert,
                    data = economics,
                    method = "glm",
                    preProc = c("center", "scale","BoxCox"),
                    trControl = myTimeControl)

Suppose that the covariates used into the train formula are predictions of values obtained by some other model. This simple model gives the following results:

Generalized Linear Model 

574 samples
3 predictor

Pre-processing: centered (3), scaled (3), Box-Cox transformation (3) 
Resampling: Rolling Forecasting Origin Resampling (12 held-out with a fixed   
window) 
Summary of sample sizes: 168, 168, 168, 168, 168, 168, ... 
Resampling results:

RMSE      Rsquared 
1446.335  0.2958317

Apart from the bad results obtained (this is only an example). I wonder if it is correct:

  1. To consider the above results as results obtained, on the entire dataset, by a GLM trained using only 24*h=24*7 samples and retrained after every horizon=12 samples
  2. How evaluate RMSE as horizon grows from 1 to 12 (as reported here http://robjhyndman.com/hyndsight/tscvexample/ )?

if I show fit.glm summary I obtain:

Call:
NULL

Deviance Residuals: 
  Min       1Q   Median       3Q      Max  
-5090.0  -1025.5   -208.1    833.4   4948.4  

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  7771.56      64.93 119.688  < 2e-16 ***
pce          5750.27    1153.03   4.987 8.15e-07 ***
pop         -1483.01    1117.06  -1.328    0.185    
psavert      2932.38     144.56  20.286  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for gaussian family taken to be 2420081)

Null deviance: 3999514594  on 573  degrees of freedom
Residual deviance: 1379446256  on 570  degrees of freedom


AIC: 10072

Number of Fisher Scoring iterations: 2 

The parameters showed refer to the last trained GLM or are "average" paramters? I hope I've been clear enough.


Solution

  • This resampling method is like any others. The RMSE is estimated using different subsets of the training data. Note that it says "Summary of sample sizes: 168, 168, 168, 168, 168, 168, ...". The final model uses all of the training data set.

    The difference between Rob's results and these are primarily due to the difference between Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE)