Search code examples
rloopsiterationlinear-regressionlm

Linear model fitting iteratively and calculate the Variable Importance with varImp() for all predictors over the iterations


I would like to iteratively fit a linear model (= LM) and calculate the variable importance of the individual variables/predictors after each iteration with the caret::varImp() function. My data table and the number of days look like this (the number of columns of dt.train can always differ and the days variable also):

d <- 50  
## Create random data table: ##
dt.train <- data.table(date = seq(as.Date('2020-01-01'), by = '1 day', length.out = 366),
                       "DE" = rnorm(366, 35, 1), "Wind" = rnorm(366, 5000, 2), "Solar" = rnorm(366, 3, 2),
                       "Nuclear" = rnorm(366, 100, 5), "ResLoad" = rnorm(366, 200, 3),  check.names = FALSE)

I also already have a function that calculates this for me every d days (here d<-50):

varImportance <- function(data){
  ## Model fitting: ##
  lmModel <- stats::lm(DE ~ .-1-date, data = data)
  
  terms <- attr(lmModel$terms , "term.labels")
  varimp <- caret::varImp(lmModel)
  importance <- data[, .(date, imp = t(varimp))]
} 
## Get variable importance: ##
dt.importance <- dt.train[, varImportance(.SD), by = seq_len(nrow(dt.train)) %/% d]

Now I would like to have a loop that builds up the iteration for me as follows:

1. Run: Model fitting with the data from "2020-01-01" to "2020-02-20" (which are here exactly the 50 days)

2. Run: Model fitting with the data from "2020-01-02" to "2020-02-21"

3. Run: Model fitting with the data from "2020-01-03" to "2020-02-22"

... and so on

Last run: Model fitting with the data from "2020-11-11" to "2020-12-31"


The variable importance should then always be saved in a new table to match the model fitting, i.e.:

variable Importance of 1. Run: should be declared for "2020-01-01",

variable Importance of 2. Run: should be declared for "2020-01-02",

... and so on

variable Importance of Last Run: should be declared for "2020-11-11".

How could this work?


Solution

  • You could use rollapply from the zoo package.

    Important arguments are :

    • width to set the window
    • by.column = FALSE to pass all the columns together to the model
    • aligned = 'left' so that the roll window starts from the first data point on

    As rollapply works on matrices, it converts dates mixed with numeric to character, see, so that the date field had to be handled separately.

    library(data.table)
    library(caret)
    library(zoo)
    
    d <- 50
    ## Create random data table: ##
    dt.train <- data.table(date = seq(as.Date('2020-01-01'), by = '1 day', length.out = 366),
                           "DE" = rnorm(366, 35, 1), "Wind" = rnorm(366, 5000, 2), "Solar" = rnorm(366, 3, 2),
                           "Nuclear" = rnorm(366, 100, 5), "ResLoad" = rnorm(366, 200, 3),  check.names = FALSE)
    
    varImportance <- function(data){
      ## Model fitting: ##
      lmModel <- stats::lm(DE ~ .-1, data = data.table(data))
      terms <- attr(lmModel$terms , "term.labels")
      varimp <- caret::varImp(lmModel)
      importance <- t(varimp)
    }
    
    # Removing date because rollapply needs a unique type
    Importance <- as.data.frame(zoo::rollapply(dt.train[,!"date"], 
                                               FUN = varImportance,
                                               width = d,
                                               by.column=FALSE,
                                               align='left')
                                )
    
    # Adding back date
    Importance <- cbind(dt.train[1:nrow(Importance),.(date)],Importance)
    
    Importance
    #>            date     Wind     Solar   Nuclear    ResLoad
    #>   1: 2020-01-01 2.523219 1.0253985 0.1676970 0.80379590
    #>   2: 2020-01-02 2.535376 1.3231915 0.3292608 0.78803748
    #>   3: 2020-01-03 2.636790 1.5249620 0.4857825 0.85169700
    #>   4: 2020-01-04 3.158113 1.1318521 0.1869724 0.24190772
    #>   5: 2020-01-05 3.326954 1.0991870 0.2341736 0.09327451
    #>  ---                                                   
    #> 313: 2020-11-08 4.552528 0.8662639 0.8824743 0.22454327
    #> 314: 2020-11-09 4.464356 0.8773634 0.8845554 0.19480862
    #> 315: 2020-11-10 4.532254 0.8230178 0.7147899 0.38073588
    #> 316: 2020-11-11 4.415192 0.7462676 0.8225977 0.32353235
    #> 317: 2020-11-12 3.666675 0.3957351 0.6607121 0.19661800
    

    This solution takes more time than the function you already use because it has 50 times more calculations than the chunck version. It also wasn't possible to use data.table::frollapply which AFAIK can only output a 1 dimensional vector.