Search code examples
rlinear-regressionnonlinear-optimization

How to use optimization technique in r to transform variables for linear regression?


I am trying to transform variables for linear Regression that can give me the best Rsquare for the model (optimize the cost function - reduce error).

This is very easy to do using Excel solver but I don't know how to do it in R. Excel solver & this whole task is demonstrated here which I am trying to replicate in r: Same done using Excel solver

The variable need to be transformed like: var_transformed = var + lag(var_transfrm)*beta)^alpha which I have done below in step 2 & 3.

the best values that need to be figured out by optimizer is alpha & beta constants for vars transformation that will help in generating best lm model / minimize error.

1. creating df

library(tidyverse)

df_test <- tribble(
 ~sales, ~var1, ~var2,
  22, 230.1,  37.8,
  10, 44.5,  39.3,
  9,  17.2,  45.9,
  19, 151.5,  41.3,
  13, 180.8,  10.8,
  7,  8.7,    48.9,
  12, 57.5,   32.8,
  13, 120.2,  19.6,
  5,  8.6,    2.1,
  11, 199.8,  2.6)

2. constants that need to be solved for appropriate value

alpha1 = .1
beta1 = .1

alpha2 = .1
beta2 = .1

3. transforming variables for lm using above constants

this concept is based on adstock & diminishing returns.

df_test  <- df_test %>% 
  mutate(var1_transfrm = var1^alpha1,
         var1_transfrm = if_else(row_number() == 1, 
                               var1^alpha1,
                               ( (var1 + lag(var1_transfrm)*beta1)^alpha1 )
                               ) ,
         
         var2_transfrm = var2^alpha2,
         var2_transfrm = if_else(row_number() == 1, 
                               var2^alpha2,
                               ( (var2 + lag(var2_transfrm)*beta2)^alpha2 )
                               ) 
         )

df_test
A tibble: 10 × 5
   sales  var1  var2 var1_transfrm var2_transfrm
   <dbl> <dbl> <dbl>         <dbl>         <dbl>
 1    22 230.   37.8          1.72          1.44
 2    10  44.5  39.3          1.46          1.44
 3     9  17.2  45.9          1.33          1.47
 4    19 152.   41.3          1.65          1.45
 5    13 181.   10.8          1.68          1.27
 6     7   8.7  48.9          1.24          1.48
 7    12  57.5  32.8          1.50          1.42
 8    13 120.   19.6          1.61          1.35
 9     5   8.6   2.1          1.24          1.08
10    11 200.    2.6          1.70          1.10

4. Creating Model lm using transformed variables

first_model <- caret::train(form = sales ~ var1_transfrm + var2_transfrm, 
                            data = df_test,
                            method = "lm")

# for Rsquare value
first_model$results$Rsquared

1 0.8952383

I need to find the best Rsquare/ minimum RMSE,MAE value above.


Solution

  • As I am new to optimization techniques so it took me a while to research and post an answer but I am glad that I could find the answer to my problem and able to use r built in functions only rather than any other to solve this.

    test_excel_solver <- function(params, df_optim){
      
      alpha1 = params[1]
      alpha2 = params[2]
      beta1 = params[3]
      beta2 = params[4]
      
      # transforming variables using a,b parameters that need to figured out
      df_optim_tranfm <- df_optim %>% 
      mutate(var1_transfrm = var1^alpha1,
             var1_transfrm = if_else(row_number() == 1, 
                                   var1^alpha1,
                                   ( (var1 + lag(var1_transfrm)*beta1)^alpha1 )
                                   ) ,
             
             var2_transfrm = var2^alpha2,
             var2_transfrm = if_else(row_number() == 1, 
                                   var2^alpha2,
                                   ( (var2 + lag(var2_transfrm)*beta2)^alpha2 )
                                   ) 
             )
             
      # setting seed for lm
      set.seed(2022)
      
      
      # creating model
      model_optim <- lm(sales ~ var1_transfrm+var2_transfrm, data = df_optim_tranfm)
      
      # predicted values
      y_hat = predict(model_optim, df_optim_tranfm)
    
      # MAE that we need to minimize using optim
      MAE <- mean(abs(df_optim_tranfm$sales - y_hat))
      MAE
      
    }
    
    # Using optim function 
    optim(par = c(.1,.1,.1,.1), # initializing alpha1=.1,alpha2=.1 & so on till beta2=.1
          df_optim = df_test,
          fn = test_excel_solver, 
          lower = c(.00001,.00001,.00001,.00001),
          upper = c(1,1,1,1),
          method = "L-BFGS-B")
    

    Output

    $par
    [1] 0.9116201 0.9687086 0.0000100 1.0000000
    
    $value
    [1] 1.378359
    
    $counts
    function gradient 
          98       98 
    
    $convergence
    [1] 52
    
    $message
    [1] "ERROR: ABNORMAL_TERMINATION_IN_LNSRCH"
    

    P.S:

    1. I am not sure about the method used & differences in other options provided to optimize.
    2. This took me just 1-2 seconds to run.
    3. I also ran without optimization/optim() in a for loop which took me more than 1 hour when I used 14 cores, so optimization technique helped me alot in time saving & computation.