I'm using the olsrr
package for statistical analysis when trying to compare as many model specifications as possible using all-possible-regressions.
My code:
model <- lm(y ~ ., data = "mydata"
k <- ols_all_subset(model)
which gives me a table with R2, adjusted R2, AIC, SIC etc for each combination of variables as is (linear). For example, if the variables are x1, x2 and x3, it gives me a table with R2, AIC, SIC etc for the specifications with every possible linear-linear specifications: with x1 x2 and x3 as regressors, with x1 and x2, with x1 and x3, with x2 and x3, and each of just x1, just x2, and just x3.
However, I also want to get all possibles for squares and logs of the variables to look at every possible major specification, for instance x1^2, log(x1), log(x3) etc.
I know I could individually create a new column and generate each x1^2, log(x1) etc. individually, but given the size of my data, this is not practical.
Here is suggestion on how to automate the process:
You don't provide any sample data, so I'm going to use mtcars
:
df <- mtcars[, c(1,5:7)];
Here, the response is column 1
, and I consider 3 predictor variables in columns 5-7
.
The goal is to automatically build all relevant terms and use those to construct a formula
, which we then use in lm
.
Build all predictor terms:
# All but the first column are predictors
terms <- sapply(colnames(df)[-1], function(x)
c(x, sprintf("I(%s^2)", x), sprintf("log(%s)", x)));
terms;
# drat wt qsec
#[1,] "drat" "wt" "qsec"
#[2,] "I(drat^2)" "I(wt^2)" "I(qsec^2)"
#[3,] "log(drat)" "log(wt)" "log(qsec)"
Construct the formula expression as a string.
exprs <- sprintf("%s ~ %s", colnames(df)[1], paste(terms, collapse = "+"));
exprs;
[1] "mpg ~ drat+I(drat^2)+log(drat)+wt+I(wt^2)+log(wt)+qsec+I(qsec^2)+log(qsec)"
Run the linear model.
model <- lm(as.formula(exprs), data = df);
Re-fit model using all combinations of predictor variables.
require(olsrr);
k <- ols_all_subset(model);
k;
# # A tibble: 511 x 6
# Index N Predictors `R-Square` `Adj. R-Square` `Mallow's Cp`
# <int> <int> <chr> <chr> <chr> <chr>
# 1 1 1 log(wt) 0.81015 0.80382 9.73408
# 2 2 1 wt 0.75283 0.74459 21.12525
# 3 3 1 I(wt^2) 0.64232 0.63040 43.08976
# 4 4 1 I(drat^2) 0.46694 0.44917 77.94664
# 5 5 1 drat 0.46400 0.44613 78.53263
# 6 6 1 log(drat) 0.45406 0.43587 80.50653
# 7 7 1 log(qsec) 0.17774 0.15033 135.42760
# 8 8 1 qsec 0.17530 0.14781 135.91242
# 9 9 1 I(qsec^2) 0.17005 0.14239 136.95476
#10 10 2 log(wt) log(qsec) 0.87935 0.87103 -2.02118
## ... with 501 more rows
A few comments:
This very quickly gets very computationally intensive.
The y ~ .
formula syntax is very concise, but I haven't found a way to include e.g. quadratic terms. On one hand, y ~ . + (.)^2
works for including all interaction terms; on the other hand, y ~ . + I(.^2)
does not work for quadratic terms. That's why I think building terms manually is the way to go.