Search code examples
rtime-seriestidyverseforecasting

Cross-validation in time series is returning errors, but non cross-validation method runs without errors and higher accuracy


I'm working to forecast the number of employees in the United States by month. The data is located at:

library(tidyverse)
library(fpp3)

# Source: https://beta.bls.gov/dataViewer/view/timeseries/CES0000000001
All_Employees <- read_csv('https://raw.githubusercontent.com/InfiniteCuriosity/predicting_labor/main/All_Employees.csv', col_select = c(Label, Value), show_col_types = FALSE)
All_Employees <- All_Employees %>%
  rename(Month = Label, Total_Employees = Value)
All_Employees <- All_Employees %>%
  mutate(Month = yearmonth(Month)) %>% 
  as_tsibble(index = Month)

I'm using the excellent text and this is the page that discusses cross-validation: Forecasting Principles and Practice, 3rd Edition

Here is the code I'm running using cross-validation:

All_Employees_train <- All_Employees %>% 
  stretch_tsibble()

All_Employees_train %>% 
  model(
    linear = TSLM(Total_Employees ~ trend() + season()),
    Exponential = TSLM(log(Total_Employees) ~ trend() + season()),
    Arima = ARIMA(Total_Employees ~ trend() + season()),
    Ets = ETS(Total_Employees),
    Mean = MEAN(Total_Employees),
    Naive = NAIVE(Total_Employees),
    SNaive = SNAIVE(Total_Employees),
    Drift = SNAIVE(Total_Employees ~ drift())) %>%
  forecast(h = 3) %>% 
  accuracy(All_Employees) %>% 
  arrange(RMSE)

That code is returning this result and more than 50 errors, here are the results:

# A tibble: 8 × 10
  .model      .type    ME   RMSE   MAE    MPE  MAPE  MASE RMSSE  ACF1
  <chr>       <chr> <dbl>  <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>
1 Naive       Test   162.  2168.  747.  0.101 0.541 0.226 0.487 0.685
2 Ets         Test   214.  4227.  774.  0.145 0.563 0.234 0.949 0.608
3 SNaive      Test   806.  4453. 3303.  0.515 2.36  1     1.00  0.866
4 Drift       Test   535.  4469. 3170.  0.343 2.28  0.960 1.00  0.868
5 Exponential Test  1861.  4692. 3942.  1.27  2.81  1.19  1.05  0.934
6 linear      Test  1887.  4697. 3952.  1.29  2.81  1.20  1.05  0.934
7 Mean        Test  3565.  6724. 5410.  2.37  3.77  1.64  1.51  0.959
8 Arima       Test  -488. 11113. 2290. -0.383 1.65  0.693 2.50  0.673
There were 50 or more warnings (use warnings() to see the first 50)

Here are a few of the 50+ errors:

Warning messages:
1: In for (i in namD) if (is.character(data[[i]])) data[[i]] <- factor(data[[i]]) :
  closing unused connection 12 (<-localhost:11913)

11: Provided exogenous regressors are rank deficient, removing regressors: `season()year2`, `season()year3`, `season()year4`, `season()year5`, `season()year6`, `season()year7`, `season()year8`, `season()year9`, `season()year10`, `season()year11`, `season()year12`

24: In sqrt(diag(best$var.coef)) : NaNs produced

27: 12 errors (2 unique) encountered for Arima

28: 3 errors (2 unique) encountered for Ets
[2] Not enough data to estimate this ETS model.
[1] only 1 case, but 2 variables

50: Problem while computing `Exponential = (function (object, ...) ...`.
ℹ prediction from a rank-deficient fit may be misleading

However, if I simply make a training set and run it against the exact same code, no errors are returned, the best results have a much lower RMSE than cross-validation, and the results are returned much faster than cross-validation (for obvious reasons). Here is the code to make the training set, and the results:

All_Employees_train <- All_Employees %>% 
  filter(Month <= yearmonth("2022 Feb"))
# A tibble: 8 × 10
  .model      .type     ME   RMSE    MAE   MPE  MAPE  MASE RMSSE      ACF1
  <chr>       <chr>  <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>     <dbl>
1 Naive       Test    819.   885.   819. 0.541 0.541 0.252 0.201 -0.000688
2 Ets         Test    825.   891.   825. 0.545 0.545 0.254 0.202 -0.000688
3 Arima       Test   1656.  1861.  1656. 1.09  1.09  0.509 0.422 -0.120   
4 Exponential Test   3075.  3178.  3075. 2.03  2.03  0.946 0.720 -0.151   
5 linear      Test   3172.  3265.  3172. 2.10  2.10  0.976 0.740 -0.143   
6 Drift       Test   5810.  5810.  5810. 3.84  3.84  1.79  1.32  -0.378   
7 SNaive      Test   6521.  6522.  6521. 4.31  4.31  2.01  1.48  -0.378   
8 Mean        Test  11457. 11462. 11457. 7.57  7.57  3.53  2.60  -0.000688

How can the cross-validation method be run without errors (and hopefully better results)?


Solution

  • Your stretched data set contains very short time series, and fitting models to them is causing these warnings. When you use stretch_tsibble(), set .init to a larger number -- this controls the length of the smallest time series. For example, use at least 2 years of data in each of the training sets:

    All_Employees_train <- All_Employees %>% 
      stretch_tsibble(.init = 24)