I'm looking for help optimizing my code to get rid of loops and increase computational speed. I am pretty new to the field and to R. I run component wise gradient boosting regressions on a linear time series model with a rolling window. I use the coefficients from the regression y on X for each window to predict the next "out of window" observation of y. (Later I will evaluate forecast accuracy)
My data are 1560 different time series (including lags of orginal series) with about 540 observations (Data Frame of dimension 540x1560)
I looked into rollapply
but couldn't get it to work. Especially I don't know how to predict yhat for each window (each iteration).
#Define windows size
w=100
##Starting Loop, rolling the window by one observation per iteration
#Predicting the next dependent variable y_hat(w+i) with the data from the "pseudo" most recent observation
for (i in 1:(nrow(df_all)-w)){
glm1 <- glmboost(fm, data=df_all[i:(w-1+i), ], center=TRUE, control=boost_control(mstop = 100, trace=TRUE))
ls_yhat[[i]] <- predict(glm1, newdata = df_all[w-1+i,])
}
Any tips appreciated (takes forever to run on my laptop)!
PS: I am also looking into using multicore
or parallel
packages. Especially b/c I'll use cross-validation for the stopping criterion later on. But I just starte looking into it. However, any tips on that are appreciated too!
Edit: Minimal working example using bulit-in data (not time series though):
library("mboost") ## load package
data("bodyfat", package = "TH.data") ## load data
##Initializing List for coefficients DFs
ls_yhat <- list()
#Define windows size
w=30
##Starting Loop, rolling the window by one observation per iteration
##Predicting the next dependent variable y_hat(w+i) with the data from the "pseudo" most recent observation
for (i in 1:(nrow(bodyfat)-w)){
glm1 <- glmboost(DEXfat~., data=bodyfat[i:(w-1+i), ], center=TRUE, control=boost_control(mstop = 15, trace=TRUE))
ls_yhat[[i]] <- predict(glm1, newdata = bodyfat[(w-1+i),])
i
}
As Vlo rightly mentionend, the bottleneck is the boosting alogrithm. I used package:foreach
and doParallel
which more than halved the running time. I wanted to share my solution.
library("mboost") ## load package
data("bodyfat", package = "TH.data") ## load data
library("foreach")
library("doParallel")
##Register backend for parallel execution
registerDoParallel()
##Initializing List for coefficients DFs
ls_yhat <- list()
#Define windows size
w=30
##Starting Loop, rolling the window by one observation per iteration
##Predicting the next dependent variable y_hat(w+i) with the data from the "pseudo" most recent observation
ls_yhat <- foreach (i = 1:(nrow(bodyfat)-w), .packages='mboost') %dopar%{
glm1 <- glmboost(DEXfat~., data=bodyfat[i:(w-1+i), ], center=TRUE, control=boost_control(mstop = 15, trace=TRUE))
ls_yhat[[i]] <- predict(glm1, newdata = bodyfat[(w-1+i),])
}