I am trying to transform variables for linear Regression that can give me the best Rsquare for the model (optimize the cost function - reduce error).
This is very easy to do using Excel solver but I don't know how to do it in R. Excel solver & this whole task is demonstrated here which I am trying to replicate in r: Same done using Excel solver
The variable need to be transformed like: var_transformed = var + lag(var_transfrm)*beta)^alpha
which I have done below in step 2 & 3.
the best values that need to be figured out by optimizer is alpha
& beta
constants for vars transformation that will help in generating best lm model / minimize error.
1. creating df
library(tidyverse)
df_test <- tribble(
~sales, ~var1, ~var2,
22, 230.1, 37.8,
10, 44.5, 39.3,
9, 17.2, 45.9,
19, 151.5, 41.3,
13, 180.8, 10.8,
7, 8.7, 48.9,
12, 57.5, 32.8,
13, 120.2, 19.6,
5, 8.6, 2.1,
11, 199.8, 2.6)
2. constants that need to be solved for appropriate value
alpha1 = .1
beta1 = .1
alpha2 = .1
beta2 = .1
3. transforming variables for lm using above constants
this concept is based on adstock
& diminishing returns
.
df_test <- df_test %>%
mutate(var1_transfrm = var1^alpha1,
var1_transfrm = if_else(row_number() == 1,
var1^alpha1,
( (var1 + lag(var1_transfrm)*beta1)^alpha1 )
) ,
var2_transfrm = var2^alpha2,
var2_transfrm = if_else(row_number() == 1,
var2^alpha2,
( (var2 + lag(var2_transfrm)*beta2)^alpha2 )
)
)
df_test
A tibble: 10 × 5
sales var1 var2 var1_transfrm var2_transfrm
<dbl> <dbl> <dbl> <dbl> <dbl>
1 22 230. 37.8 1.72 1.44
2 10 44.5 39.3 1.46 1.44
3 9 17.2 45.9 1.33 1.47
4 19 152. 41.3 1.65 1.45
5 13 181. 10.8 1.68 1.27
6 7 8.7 48.9 1.24 1.48
7 12 57.5 32.8 1.50 1.42
8 13 120. 19.6 1.61 1.35
9 5 8.6 2.1 1.24 1.08
10 11 200. 2.6 1.70 1.10
4. Creating Model lm using transformed variables
first_model <- caret::train(form = sales ~ var1_transfrm + var2_transfrm,
data = df_test,
method = "lm")
# for Rsquare value
first_model$results$Rsquared
1 0.8952383
I need to find the best Rsquare/ minimum RMSE,MAE value above.
As I am new to optimization
techniques so it took me a while to research and post an answer but I am glad that I could find the answer to my problem and able to use r
built in functions only rather than any other to solve this.
test_excel_solver <- function(params, df_optim){
alpha1 = params[1]
alpha2 = params[2]
beta1 = params[3]
beta2 = params[4]
# transforming variables using a,b parameters that need to figured out
df_optim_tranfm <- df_optim %>%
mutate(var1_transfrm = var1^alpha1,
var1_transfrm = if_else(row_number() == 1,
var1^alpha1,
( (var1 + lag(var1_transfrm)*beta1)^alpha1 )
) ,
var2_transfrm = var2^alpha2,
var2_transfrm = if_else(row_number() == 1,
var2^alpha2,
( (var2 + lag(var2_transfrm)*beta2)^alpha2 )
)
)
# setting seed for lm
set.seed(2022)
# creating model
model_optim <- lm(sales ~ var1_transfrm+var2_transfrm, data = df_optim_tranfm)
# predicted values
y_hat = predict(model_optim, df_optim_tranfm)
# MAE that we need to minimize using optim
MAE <- mean(abs(df_optim_tranfm$sales - y_hat))
MAE
}
# Using optim function
optim(par = c(.1,.1,.1,.1), # initializing alpha1=.1,alpha2=.1 & so on till beta2=.1
df_optim = df_test,
fn = test_excel_solver,
lower = c(.00001,.00001,.00001,.00001),
upper = c(1,1,1,1),
method = "L-BFGS-B")
Output
$par
[1] 0.9116201 0.9687086 0.0000100 1.0000000
$value
[1] 1.378359
$counts
function gradient
98 98
$convergence
[1] 52
$message
[1] "ERROR: ABNORMAL_TERMINATION_IN_LNSRCH"
P.S: