I want to predict a time series from a model I estimate with tslm
from the forecast
package. Here is some data:
x <- ts(rnorm(120,0,3) + 1:120 + 20*sin(2*pi*(1:120)/12), frequency=12, start= c(2000, 01, 01))
y <- ts(x + rnorm(length(x)), frequency=12, start= c(2000, 01, 01))
df <- data.frame(y, x)
So we have an (independent) variable x
with some pattern and a (dependent) variable y
which appears to be a noisy version of x. I fit the model like this:
fit <- tslm(y ~ trend + season + x, df)
summary(fit)
looks okay, since x
is highly significant and estimate is close to 1. But running forecast(fit, h=20)
gives me an error:
... variable lengths differ (found for 'x') ...
forecast(fit, h= length(x))
works (although plot(forecast(fit, h= length(x)))
looks very strange, but this is an other question).
To forecast y
into the future using predictors like x
, trend
and seasonal
,
new data for the predictors must be supplied for the number of periods ahead you want to forecast.
This can be done using the argument newdata
in forecast.lm
(see ?forecast.lm
)
Under follows an example with only x as predictor where we want to forecast y
for the next 12 months
library(forecast)
n <- 132
set.seed(1337)
x <- ts(rnorm(n,0,3) + 1:n + 20*sin(2*pi*(1:n)/12), frequency=12, start= c(2000, 01, 01))
#Dividing x into a train and test set, where the test set will be the data we use to forecast ´y´
xtrain <- window(x, end=c(2009, 12))
xtest <- data.frame(x=window(x, start=c(2010, 1)))
y <- window(ts(x + rnorm(length(x)), frequency=12, start= c(2000, 01, 01)), end=c(2009,12))
dftrain <- data.frame(y, x=xtrain)
fit <- tslm(y ~ x, dftrain)
f <- forecast(fit, newdata=xtest)
plot(f)
What makes the tslm
function a bit 'special' is that it generates data for trend and seasonality automatically if this is specified, e.g.
fit2 <- tslm(y~trend+season)
f2 <- forecast(fit2, h=12)
plot(f2)
Here it automatically generates data for the newdata
argument, which can be found here:
f2$newdata #Beware, season is a factor: str(f2$newdata)
If we combine trend, season and x
, we get
fit3 <- tslm(y~trend+season+x, data=dftrain)
f3 <- forecast(fit3, newdata=xtest)
f3$newdata
Strange! Even though we expect it to use all predictors for the forecast, trend and season is not included in f$newdata
. We can try to include trend and seasonal manually and check if we get the same results:
#Using `seasonaldummy` from the `forecast` package to generate the seasonal dummy matrix.
#Beware: `forecast::seasonaldummy` use December as reference period by default, while `tslm` use January.
#This should not affect our results, except for the interpretation of the seasonal coefficients.
dftrain2 <- data.frame(y, x=xtrain, customTrend=1:(n-12), forecast::seasonaldummy(xtrain))
dftest2 <- data.frame(x=xtest, customTrend = (n-12+1):n, forecast::seasonaldummy(xtrain, h=12))
fit4 <- tslm(y~customTrend+Jan+Feb+Mar+Apr+May+Jun+Jul+Aug+Sep+Oct+Nov+x, data=dftrain2)
f4 <- forecast(fit4, newdata = dftest2)
f4$newdata #now everything is included.
#Compare the forecasts generated by fit3 and fit4:
f3$mean - f4$mean #Close enough
all.equal(f3$mean, f4$mean) #Point forecast
all.equal(f3$lower, f4$lower) #PIs
all.equal(f3$upper, f4$upper) #PIs
We can also include the seasonal variable as factor
, which is a bit easier (but less intuitive in my opinion), and yields completely identical coefficient estimates as fit3
.
dftrain3 <- data.frame(y, x=xtrain, customTrend=1:(n-12), customSeason = rep(factor(1:12, levels=1:12), 10))
dftest3 <- data.frame(x=xtest, customTrend = (n-12+1):n, customSeason = factor(1:12, levels=1:12))
fit5 <- tslm(y~customTrend+customSeason+x, data=dftrain3)
all(coefficients(fit3) == coefficients(fit5))
f5 <- forecast(fit5, newdata = dftest3)
f5$newdata