I'm working to forecast the number of employees in the United States by month. The data is located at:
library(tidyverse)
library(fpp3)
# Source: https://beta.bls.gov/dataViewer/view/timeseries/CES0000000001
All_Employees <- read_csv('https://raw.githubusercontent.com/InfiniteCuriosity/predicting_labor/main/All_Employees.csv', col_select = c(Label, Value), show_col_types = FALSE)
All_Employees <- All_Employees %>%
rename(Month = Label, Total_Employees = Value)
All_Employees <- All_Employees %>%
mutate(Month = yearmonth(Month)) %>%
as_tsibble(index = Month)
I'm using the excellent text and this is the page that discusses cross-validation: Forecasting Principles and Practice, 3rd Edition
Here is the code I'm running using cross-validation:
All_Employees_train <- All_Employees %>%
stretch_tsibble()
All_Employees_train %>%
model(
linear = TSLM(Total_Employees ~ trend() + season()),
Exponential = TSLM(log(Total_Employees) ~ trend() + season()),
Arima = ARIMA(Total_Employees ~ trend() + season()),
Ets = ETS(Total_Employees),
Mean = MEAN(Total_Employees),
Naive = NAIVE(Total_Employees),
SNaive = SNAIVE(Total_Employees),
Drift = SNAIVE(Total_Employees ~ drift())) %>%
forecast(h = 3) %>%
accuracy(All_Employees) %>%
arrange(RMSE)
That code is returning this result and more than 50 errors, here are the results:
# A tibble: 8 × 10
.model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Naive Test 162. 2168. 747. 0.101 0.541 0.226 0.487 0.685
2 Ets Test 214. 4227. 774. 0.145 0.563 0.234 0.949 0.608
3 SNaive Test 806. 4453. 3303. 0.515 2.36 1 1.00 0.866
4 Drift Test 535. 4469. 3170. 0.343 2.28 0.960 1.00 0.868
5 Exponential Test 1861. 4692. 3942. 1.27 2.81 1.19 1.05 0.934
6 linear Test 1887. 4697. 3952. 1.29 2.81 1.20 1.05 0.934
7 Mean Test 3565. 6724. 5410. 2.37 3.77 1.64 1.51 0.959
8 Arima Test -488. 11113. 2290. -0.383 1.65 0.693 2.50 0.673
There were 50 or more warnings (use warnings() to see the first 50)
Here are a few of the 50+ errors:
Warning messages:
1: In for (i in namD) if (is.character(data[[i]])) data[[i]] <- factor(data[[i]]) :
closing unused connection 12 (<-localhost:11913)
11: Provided exogenous regressors are rank deficient, removing regressors: `season()year2`, `season()year3`, `season()year4`, `season()year5`, `season()year6`, `season()year7`, `season()year8`, `season()year9`, `season()year10`, `season()year11`, `season()year12`
24: In sqrt(diag(best$var.coef)) : NaNs produced
27: 12 errors (2 unique) encountered for Arima
28: 3 errors (2 unique) encountered for Ets
[2] Not enough data to estimate this ETS model.
[1] only 1 case, but 2 variables
50: Problem while computing `Exponential = (function (object, ...) ...`.
ℹ prediction from a rank-deficient fit may be misleading
However, if I simply make a training set and run it against the exact same code, no errors are returned, the best results have a much lower RMSE than cross-validation, and the results are returned much faster than cross-validation (for obvious reasons). Here is the code to make the training set, and the results:
All_Employees_train <- All_Employees %>%
filter(Month <= yearmonth("2022 Feb"))
# A tibble: 8 × 10
.model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Naive Test 819. 885. 819. 0.541 0.541 0.252 0.201 -0.000688
2 Ets Test 825. 891. 825. 0.545 0.545 0.254 0.202 -0.000688
3 Arima Test 1656. 1861. 1656. 1.09 1.09 0.509 0.422 -0.120
4 Exponential Test 3075. 3178. 3075. 2.03 2.03 0.946 0.720 -0.151
5 linear Test 3172. 3265. 3172. 2.10 2.10 0.976 0.740 -0.143
6 Drift Test 5810. 5810. 5810. 3.84 3.84 1.79 1.32 -0.378
7 SNaive Test 6521. 6522. 6521. 4.31 4.31 2.01 1.48 -0.378
8 Mean Test 11457. 11462. 11457. 7.57 7.57 3.53 2.60 -0.000688
How can the cross-validation method be run without errors (and hopefully better results)?
Your stretched data set contains very short time series, and fitting models to them is causing these warnings. When you use stretch_tsibble()
, set .init
to a larger number -- this controls the length of the smallest time series. For example, use at least 2 years of data in each of the training sets:
All_Employees_train <- All_Employees %>%
stretch_tsibble(.init = 24)