r time-series prediction forecasting forecast

Can not choose number of periods for forecasting any other than variable length

I want to predict a time series from a model I estimate with tslm from the forecast package. Here is some data:

x <- ts(rnorm(120,0,3) + 1:120 + 20*sin(2*pi*(1:120)/12), frequency=12, start= c(2000, 01, 01))
y <- ts(x + rnorm(length(x)), frequency=12, start= c(2000, 01, 01))
df <- data.frame(y, x)

So we have an (independent) variable x with some pattern and a (dependent) variable y which appears to be a noisy version of x. I fit the model like this:

fit <- tslm(y ~ trend + season + x, df)

summary(fit) looks okay, since x is highly significant and estimate is close to 1. But running forecast(fit, h=20) gives me an error:

... variable lengths differ (found for 'x') ...

forecast(fit, h= length(x)) works (although plot(forecast(fit, h= length(x))) looks very strange, but this is an other question).

Solution

To forecast y into the future using predictors like x, trend and seasonal, new data for the predictors must be supplied for the number of periods ahead you want to forecast. This can be done using the argument newdata in forecast.lm (see ?forecast.lm)

Under follows an example with only x as predictor where we want to forecast y for the next 12 months

library(forecast)
n <- 132 
set.seed(1337)
x <- ts(rnorm(n,0,3) + 1:n + 20*sin(2*pi*(1:n)/12), frequency=12, start= c(2000, 01, 01))
#Dividing x into a train and test set, where the test set will be the data we use to forecast ´y´
xtrain <- window(x, end=c(2009, 12))
xtest <- data.frame(x=window(x, start=c(2010, 1)))
y <- window(ts(x + rnorm(length(x)), frequency=12, start= c(2000, 01, 01)), end=c(2009,12))
dftrain <- data.frame(y, x=xtrain)

fit <- tslm(y ~ x, dftrain)
f <- forecast(fit, newdata=xtest)
plot(f)

What makes the tslm function a bit 'special' is that it generates data for trend and seasonality automatically if this is specified, e.g.

fit2 <- tslm(y~trend+season)
f2 <- forecast(fit2, h=12)
plot(f2)

Here it automatically generates data for the newdata argument, which can be found here:

f2$newdata #Beware, season is a factor: str(f2$newdata)

If we combine trend, season and x, we get

fit3 <- tslm(y~trend+season+x, data=dftrain)
f3 <- forecast(fit3, newdata=xtest)
f3$newdata

Strange! Even though we expect it to use all predictors for the forecast, trend and season is not included in f$newdata. We can try to include trend and seasonal manually and check if we get the same results:

#Using `seasonaldummy` from the `forecast` package to generate the seasonal dummy matrix. 
#Beware: `forecast::seasonaldummy` use December as reference period by default, while `tslm` use January.
#This should not affect our results, except for the interpretation of the seasonal coefficients.
dftrain2 <- data.frame(y, x=xtrain, customTrend=1:(n-12), forecast::seasonaldummy(xtrain))
dftest2 <- data.frame(x=xtest, customTrend = (n-12+1):n, forecast::seasonaldummy(xtrain, h=12))
fit4 <- tslm(y~customTrend+Jan+Feb+Mar+Apr+May+Jun+Jul+Aug+Sep+Oct+Nov+x, data=dftrain2)
f4 <- forecast(fit4, newdata = dftest2)

f4$newdata #now everything is included. 

#Compare the forecasts generated by fit3 and fit4:
f3$mean - f4$mean #Close enough
all.equal(f3$mean, f4$mean) #Point forecast
all.equal(f3$lower, f4$lower) #PIs
all.equal(f3$upper, f4$upper) #PIs

We can also include the seasonal variable as factor, which is a bit easier (but less intuitive in my opinion), and yields completely identical coefficient estimates as fit3.

dftrain3 <- data.frame(y, x=xtrain, customTrend=1:(n-12), customSeason = rep(factor(1:12, levels=1:12), 10))
dftest3 <- data.frame(x=xtest, customTrend = (n-12+1):n, customSeason = factor(1:12, levels=1:12))
fit5 <- tslm(y~customTrend+customSeason+x, data=dftrain3)
all(coefficients(fit3) == coefficients(fit5))
f5 <- forecast(fit5, newdata = dftest3)
f5$newdata