I would like to iteratively fit a linear model (= LM) and calculate the variable importance of the individual variables/predictors after each iteration with the caret::varImp()
function. My data table and the number of days look like this (the number of columns of dt.train
can always differ and the days
variable also):
d <- 50
## Create random data table: ##
dt.train <- data.table(date = seq(as.Date('2020-01-01'), by = '1 day', length.out = 366),
"DE" = rnorm(366, 35, 1), "Wind" = rnorm(366, 5000, 2), "Solar" = rnorm(366, 3, 2),
"Nuclear" = rnorm(366, 100, 5), "ResLoad" = rnorm(366, 200, 3), check.names = FALSE)
I also already have a function that calculates this for me every d
days (here d<-50
):
varImportance <- function(data){
## Model fitting: ##
lmModel <- stats::lm(DE ~ .-1-date, data = data)
terms <- attr(lmModel$terms , "term.labels")
varimp <- caret::varImp(lmModel)
importance <- data[, .(date, imp = t(varimp))]
}
## Get variable importance: ##
dt.importance <- dt.train[, varImportance(.SD), by = seq_len(nrow(dt.train)) %/% d]
Now I would like to have a loop that builds up the iteration for me as follows:
1. Run: Model fitting with the data from "2020-01-01" to "2020-02-20" (which are here exactly the 50 days)
2. Run: Model fitting with the data from "2020-01-02" to "2020-02-21"
3. Run: Model fitting with the data from "2020-01-03" to "2020-02-22"
... and so on
Last run: Model fitting with the data from "2020-11-11" to "2020-12-31"
The variable importance should then always be saved in a new table to match the model fitting, i.e.:
variable Importance of 1. Run: should be declared for "2020-01-01",
variable Importance of 2. Run: should be declared for "2020-01-02",
... and so on
variable Importance of Last Run: should be declared for "2020-11-11".
How could this work?
You could use rollapply
from the zoo
package.
Important arguments are :
width
to set the windowby.column = FALSE
to pass all the columns together to the modelaligned = 'left'
so that the roll window starts from the first data point onAs rollapply
works on matrices, it converts dates
mixed with numeric
to character
, see, so that the date
field had to be handled separately.
library(data.table)
library(caret)
library(zoo)
d <- 50
## Create random data table: ##
dt.train <- data.table(date = seq(as.Date('2020-01-01'), by = '1 day', length.out = 366),
"DE" = rnorm(366, 35, 1), "Wind" = rnorm(366, 5000, 2), "Solar" = rnorm(366, 3, 2),
"Nuclear" = rnorm(366, 100, 5), "ResLoad" = rnorm(366, 200, 3), check.names = FALSE)
varImportance <- function(data){
## Model fitting: ##
lmModel <- stats::lm(DE ~ .-1, data = data.table(data))
terms <- attr(lmModel$terms , "term.labels")
varimp <- caret::varImp(lmModel)
importance <- t(varimp)
}
# Removing date because rollapply needs a unique type
Importance <- as.data.frame(zoo::rollapply(dt.train[,!"date"],
FUN = varImportance,
width = d,
by.column=FALSE,
align='left')
)
# Adding back date
Importance <- cbind(dt.train[1:nrow(Importance),.(date)],Importance)
Importance
#> date Wind Solar Nuclear ResLoad
#> 1: 2020-01-01 2.523219 1.0253985 0.1676970 0.80379590
#> 2: 2020-01-02 2.535376 1.3231915 0.3292608 0.78803748
#> 3: 2020-01-03 2.636790 1.5249620 0.4857825 0.85169700
#> 4: 2020-01-04 3.158113 1.1318521 0.1869724 0.24190772
#> 5: 2020-01-05 3.326954 1.0991870 0.2341736 0.09327451
#> ---
#> 313: 2020-11-08 4.552528 0.8662639 0.8824743 0.22454327
#> 314: 2020-11-09 4.464356 0.8773634 0.8845554 0.19480862
#> 315: 2020-11-10 4.532254 0.8230178 0.7147899 0.38073588
#> 316: 2020-11-11 4.415192 0.7462676 0.8225977 0.32353235
#> 317: 2020-11-12 3.666675 0.3957351 0.6607121 0.19661800
This solution takes more time than the function you already use because it has 50 times more calculations than the chunck version. It also wasn't possible to use data.table::frollapply
which AFAIK can only output a 1 dimensional vector.