Search code examples
rloopsdata.tablelinear-regressionprediction

Iteration for simple linear regression model


I would like to fit/train a predictive model and do it intertatively, that is, for example, I train my model every 50 days in the selected period (here: all of 2020). Basically I want to predict the second column (DE) and use the remaining columns and parameters for the prediction. My data table can look like this:

set.seed(123)
days <- 50
## Create random data table: ##
dt.data <- data.table(date = seq(as.Date('2020-01-01'), by = '1 day', length.out = 366),
                      "DE" = rnorm(366, 35, 1), "Wind" = rnorm(366, 5000, 2), "Solar" = rnorm(366, 3, 2),
                      "Nuclear" = rnorm(366, 100, 5), "ResLoad" = rnorm(366, 200, 3),  check.names = FALSE)

The date range of my data table, the number of predictors and the number of days (days) can always differ. I have already done the model fitting/training for the whole data table once, but I do not know how to do this iteratively every 50 days? Here you can see a code snippet of my model fitting for a linear model:

v.trainDate <- dt.data$date
## Delete column "date" of train data for model fitting: ##
dt.data <- dt.data[, c("date") := NULL]

## MODEL FITTING: ##
## Linear Model: ##
lmModel <- stats::lm(DE ~ .-1, data = dt.data)

## Train PREDICTION with lmModel: ##
dt.data$prediction <- stats::predict.glm(lmModel, dt.data)
## Add date columns to dt.train: ##
dt.data <- data.table(date = v.trainDate, dt.data)

What I want to have at the end is that I train the model with my data first from 2020-01-01 to 2020-02-20 (first 50 days) and predict the DE price with this fitted model lmModel for the first fifty entries of my data table. Next run should be to train my model from 2020-02-20 to 2020-04-10 (next 50 days) and predict the values for this new 50 days. This should be done until the last December day for 2020. At the end I need a column, calledprediction as you can see in my code snippet, but this column should consist the interatively constructed predictions of the DE price.

I would also like to save the Variable Importance somewhere after each iteration? So that I can see which variable had the most influence on the DE price in the first 50 days, etc. Does anyone know how this could work?


Solution

  • The calclm function below calculates prediction and the calcImportance function calculates variable importance.

    by=seq_len(nrow(dt.data)) %/% days argument using data.table splits the dataset in days chuncks and applies the previous functions to each chunck :

    library(data.table)
    library(caret)
    
    ## Create random data table: ##
    dt.data <- data.table(date = seq(as.Date('2020-01-01'), by = '1 day', length.out = 366),
                          "DE" = rnorm(366, 35, 1), "Wind" = rnorm(366, 5000, 2), "Solar" = rnorm(366, 3, 2),
                          "Nuclear" = rnorm(366, 100, 5), "ResLoad" = rnorm(366, 200, 3),  check.names = FALSE)
    
    set.seed(123)
    days <- 50
    
    # Prediction calculation
    calcPred <- function(data) {
      lmModel <- stats::lm(DE ~ .-1-date, data = data)
      stats::predict.glm(lmModel, data)
    }
    
    # Importance calculation
    calcImportance <- function(data) {
      lmModel <- stats::lm(DE ~ .-1-date, data = data)
      terms <- attr(lmModel$terms , "term.labels")
      varimp <- caret::varImp(lmModel)
      importance <- data[,.(date,imp = t(varimp))]
    }
    
    importance.data <- data.table::copy(dt.data)
    importance.data[,calcImportance(.SD),by=seq_len(nrow(dt.data)) %/% days]
    #>      seq_len       date imp.Wind imp.Solar imp.Nuclear imp.ResLoad
    #>   1:       0 2020-01-01 4.598201 2.4726894   0.7993097   1.7153244
    #>   2:       0 2020-01-02 4.598201 2.4726894   0.7993097   1.7153244
    #>   3:       0 2020-01-03 4.598201 2.4726894   0.7993097   1.7153244
    #>   4:       0 2020-01-04 4.598201 2.4726894   0.7993097   1.7153244
    #>   5:       0 2020-01-05 4.598201 2.4726894   0.7993097   1.7153244
    #>  ---                                                              
    #> 362:       7 2020-12-27 1.093177 0.2558265   0.1610440   0.5383146
    #> 363:       7 2020-12-28 1.093177 0.2558265   0.1610440   0.5383146
    #> 364:       7 2020-12-29 1.093177 0.2558265   0.1610440   0.5383146
    #> 365:       7 2020-12-30 1.093177 0.2558265   0.1610440   0.5383146
    #> 366:       7 2020-12-31 1.093177 0.2558265   0.1610440   0.5383146
    
    dt.data[,c('prediction'):=calcPred(.SD),by=seq_len(nrow(dt.data)) %/% days]
    
    dt.data
    #>            date       DE     Wind      Solar   Nuclear  ResLoad prediction
    #>   1: 2020-01-01 36.51972 5000.608  1.4283653  92.19844 200.1163   35.02625
    #>   2: 2020-01-02 34.96544 4999.235  0.3860045  92.29005 203.2613   34.96200
    #>   3: 2020-01-03 35.16448 5002.232 -1.4450524 100.32920 202.5225   35.67255
    #>   4: 2020-01-04 36.07978 5000.564 -0.9137483  98.07459 206.3788   35.13325
    #>   5: 2020-01-05 35.10967 4997.606  4.9788029 101.27625 201.8148   34.29788
    #>  ---                                                                      
    #> 362: 2020-12-27 34.98190 4997.936  2.6117305  98.33027 195.1352   34.80871
    #> 363: 2020-12-28 35.16974 4998.799  1.0776123 108.26064 195.2474   35.01232
    #> 364: 2020-12-29 35.37956 4998.651  3.9252237 102.87948 201.0266   35.09362
    #> 365: 2020-12-30 35.51517 4999.428  5.9747031  92.38721 196.4204   34.60962
    #> 366: 2020-12-31 33.53278 5001.911  3.5062344  93.60744 197.5292   34.85689