I would like to fit/train a predictive model and do it intertatively, that is, for example, I train my model every 50
days in the selected period (here: all of 2020). Basically I want to predict the second column (DE
) and use the remaining columns and parameters for the prediction. My data table can look like this:
set.seed(123)
days <- 50
## Create random data table: ##
dt.data <- data.table(date = seq(as.Date('2020-01-01'), by = '1 day', length.out = 366),
"DE" = rnorm(366, 35, 1), "Wind" = rnorm(366, 5000, 2), "Solar" = rnorm(366, 3, 2),
"Nuclear" = rnorm(366, 100, 5), "ResLoad" = rnorm(366, 200, 3), check.names = FALSE)
The date range of my data table, the number of predictors and the number of days (days
) can always differ. I have already done the model fitting/training for the whole data table once, but I do not know how to do this iteratively every 50
days? Here you can see a code snippet of my model fitting for a linear model:
v.trainDate <- dt.data$date
## Delete column "date" of train data for model fitting: ##
dt.data <- dt.data[, c("date") := NULL]
## MODEL FITTING: ##
## Linear Model: ##
lmModel <- stats::lm(DE ~ .-1, data = dt.data)
## Train PREDICTION with lmModel: ##
dt.data$prediction <- stats::predict.glm(lmModel, dt.data)
## Add date columns to dt.train: ##
dt.data <- data.table(date = v.trainDate, dt.data)
What I want to have at the end is that I train the model with my data first from 2020-01-01
to 2020-02-20
(first 50 days) and predict the DE
price with this fitted model lmModel
for the first fifty entries of my data table. Next run should be to train my model from 2020-02-20
to 2020-04-10
(next 50 days) and predict the values for this new 50 days. This should be done until the last December day for 2020. At the end I need a column, calledprediction
as you can see in my code snippet, but this column should consist the interatively constructed predictions of the DE
price.
I would also like to save the Variable Importance somewhere after each iteration? So that I can see which variable had the most influence on the DE
price in the first 50 days, etc. Does anyone know how this could work?
The calclm
function below calculates prediction and the calcImportance
function calculates variable importance.
by=seq_len(nrow(dt.data)) %/% days
argument using data.table
splits the dataset in days
chuncks and applies the previous functions to each chunck :
library(data.table)
library(caret)
## Create random data table: ##
dt.data <- data.table(date = seq(as.Date('2020-01-01'), by = '1 day', length.out = 366),
"DE" = rnorm(366, 35, 1), "Wind" = rnorm(366, 5000, 2), "Solar" = rnorm(366, 3, 2),
"Nuclear" = rnorm(366, 100, 5), "ResLoad" = rnorm(366, 200, 3), check.names = FALSE)
set.seed(123)
days <- 50
# Prediction calculation
calcPred <- function(data) {
lmModel <- stats::lm(DE ~ .-1-date, data = data)
stats::predict.glm(lmModel, data)
}
# Importance calculation
calcImportance <- function(data) {
lmModel <- stats::lm(DE ~ .-1-date, data = data)
terms <- attr(lmModel$terms , "term.labels")
varimp <- caret::varImp(lmModel)
importance <- data[,.(date,imp = t(varimp))]
}
importance.data <- data.table::copy(dt.data)
importance.data[,calcImportance(.SD),by=seq_len(nrow(dt.data)) %/% days]
#> seq_len date imp.Wind imp.Solar imp.Nuclear imp.ResLoad
#> 1: 0 2020-01-01 4.598201 2.4726894 0.7993097 1.7153244
#> 2: 0 2020-01-02 4.598201 2.4726894 0.7993097 1.7153244
#> 3: 0 2020-01-03 4.598201 2.4726894 0.7993097 1.7153244
#> 4: 0 2020-01-04 4.598201 2.4726894 0.7993097 1.7153244
#> 5: 0 2020-01-05 4.598201 2.4726894 0.7993097 1.7153244
#> ---
#> 362: 7 2020-12-27 1.093177 0.2558265 0.1610440 0.5383146
#> 363: 7 2020-12-28 1.093177 0.2558265 0.1610440 0.5383146
#> 364: 7 2020-12-29 1.093177 0.2558265 0.1610440 0.5383146
#> 365: 7 2020-12-30 1.093177 0.2558265 0.1610440 0.5383146
#> 366: 7 2020-12-31 1.093177 0.2558265 0.1610440 0.5383146
dt.data[,c('prediction'):=calcPred(.SD),by=seq_len(nrow(dt.data)) %/% days]
dt.data
#> date DE Wind Solar Nuclear ResLoad prediction
#> 1: 2020-01-01 36.51972 5000.608 1.4283653 92.19844 200.1163 35.02625
#> 2: 2020-01-02 34.96544 4999.235 0.3860045 92.29005 203.2613 34.96200
#> 3: 2020-01-03 35.16448 5002.232 -1.4450524 100.32920 202.5225 35.67255
#> 4: 2020-01-04 36.07978 5000.564 -0.9137483 98.07459 206.3788 35.13325
#> 5: 2020-01-05 35.10967 4997.606 4.9788029 101.27625 201.8148 34.29788
#> ---
#> 362: 2020-12-27 34.98190 4997.936 2.6117305 98.33027 195.1352 34.80871
#> 363: 2020-12-28 35.16974 4998.799 1.0776123 108.26064 195.2474 35.01232
#> 364: 2020-12-29 35.37956 4998.651 3.9252237 102.87948 201.0266 35.09362
#> 365: 2020-12-30 35.51517 4999.428 5.9747031 92.38721 196.4204 34.60962
#> 366: 2020-12-31 33.53278 5001.911 3.5062344 93.60744 197.5292 34.85689