Search code examples
rfor-loopparallel-foreachrollapply

Alternative to For-Loop: How to use rolling window for boosting in R?


I'm looking for help optimizing my code to get rid of loops and increase computational speed. I am pretty new to the field and to R. I run component wise gradient boosting regressions on a linear time series model with a rolling window. I use the coefficients from the regression y on X for each window to predict the next "out of window" observation of y. (Later I will evaluate forecast accuracy)

My data are 1560 different time series (including lags of orginal series) with about 540 observations (Data Frame of dimension 540x1560)

I looked into rollapply but couldn't get it to work. Especially I don't know how to predict yhat for each window (each iteration).

#Define windows size
w=100
##Starting Loop, rolling the window by one observation per iteration 
#Predicting the next dependent variable y_hat(w+i) with the data from the "pseudo" most recent observation
for (i in 1:(nrow(df_all)-w)){
glm1 <- glmboost(fm, data=df_all[i:(w-1+i), ], center=TRUE, control=boost_control(mstop = 100, trace=TRUE)) 
ls_yhat[[i]] <- predict(glm1, newdata = df_all[w-1+i,])
}

Any tips appreciated (takes forever to run on my laptop)!

PS: I am also looking into using multicore or parallel packages. Especially b/c I'll use cross-validation for the stopping criterion later on. But I just starte looking into it. However, any tips on that are appreciated too!

Edit: Minimal working example using bulit-in data (not time series though):

library("mboost") ## load package
data("bodyfat", package = "TH.data") ## load data

##Initializing List for coefficients DFs
ls_yhat <- list()

#Define windows size
w=30
##Starting Loop, rolling the window by one observation per iteration 
##Predicting the next dependent variable y_hat(w+i) with the data from the "pseudo" most recent observation
for (i in 1:(nrow(bodyfat)-w)){
  glm1 <- glmboost(DEXfat~., data=bodyfat[i:(w-1+i), ], center=TRUE, control=boost_control(mstop = 15, trace=TRUE)) 
  ls_yhat[[i]] <- predict(glm1, newdata = bodyfat[(w-1+i),])
  i
}

Solution

  • As Vlo rightly mentionend, the bottleneck is the boosting alogrithm. I used package:foreach and doParallel which more than halved the running time. I wanted to share my solution.

    library("mboost") ## load package
    data("bodyfat", package = "TH.data") ## load data
    library("foreach")
    library("doParallel")
    
    ##Register backend for parallel execution
    registerDoParallel()
    
    ##Initializing List for coefficients DFs
    ls_yhat <- list()
    
    #Define windows size
    w=30
    ##Starting Loop, rolling the window by one observation per iteration 
    ##Predicting the next dependent variable y_hat(w+i) with the data from the "pseudo" most recent observation
    ls_yhat <- foreach (i = 1:(nrow(bodyfat)-w), .packages='mboost') %dopar%{
      glm1 <- glmboost(DEXfat~., data=bodyfat[i:(w-1+i), ], center=TRUE, control=boost_control(mstop = 15, trace=TRUE)) 
      ls_yhat[[i]] <- predict(glm1, newdata = bodyfat[(w-1+i),])
    }