Search code examples
rstatistics

How can I make all-possible-regressions in R also include exponents and logs of the variables?


I'm using the olsrr package for statistical analysis when trying to compare as many model specifications as possible using all-possible-regressions.

My code:

model <- lm(y ~ ., data = "mydata"
k <- ols_all_subset(model)

which gives me a table with R2, adjusted R2, AIC, SIC etc for each combination of variables as is (linear). For example, if the variables are x1, x2 and x3, it gives me a table with R2, AIC, SIC etc for the specifications with every possible linear-linear specifications: with x1 x2 and x3 as regressors, with x1 and x2, with x1 and x3, with x2 and x3, and each of just x1, just x2, and just x3.

However, I also want to get all possibles for squares and logs of the variables to look at every possible major specification, for instance x1^2, log(x1), log(x3) etc.

I know I could individually create a new column and generate each x1^2, log(x1) etc. individually, but given the size of my data, this is not practical.


Solution

  • Here is suggestion on how to automate the process:

    You don't provide any sample data, so I'm going to use mtcars:

    df <- mtcars[, c(1,5:7)];
    

    Here, the response is column 1, and I consider 3 predictor variables in columns 5-7.

    The goal is to automatically build all relevant terms and use those to construct a formula, which we then use in lm.

    1. Build all predictor terms:

      # All but the first column are predictors
      terms <- sapply(colnames(df)[-1], function(x)
          c(x, sprintf("I(%s^2)", x), sprintf("log(%s)", x)));
      terms;
      #     drat        wt        qsec
      #[1,] "drat"      "wt"      "qsec"
      #[2,] "I(drat^2)" "I(wt^2)" "I(qsec^2)"
      #[3,] "log(drat)" "log(wt)" "log(qsec)"
      
    2. Construct the formula expression as a string.

      exprs <- sprintf("%s ~ %s", colnames(df)[1], paste(terms, collapse = "+"));
      exprs;
      [1] "mpg ~ drat+I(drat^2)+log(drat)+wt+I(wt^2)+log(wt)+qsec+I(qsec^2)+log(qsec)"
      
    3. Run the linear model.

      model <- lm(as.formula(exprs), data = df);
      
    4. Re-fit model using all combinations of predictor variables.

      require(olsrr);
      k <- ols_all_subset(model);
      k;
      #    # A tibble: 511 x 6
      #   Index     N        Predictors `R-Square` `Adj. R-Square` `Mallow's Cp`
      #   <int> <int>             <chr>      <chr>           <chr>         <chr>
      # 1     1     1           log(wt)    0.81015         0.80382       9.73408
      # 2     2     1                wt    0.75283         0.74459      21.12525
      # 3     3     1           I(wt^2)    0.64232         0.63040      43.08976
      # 4     4     1         I(drat^2)    0.46694         0.44917      77.94664
      # 5     5     1              drat    0.46400         0.44613      78.53263
      # 6     6     1         log(drat)    0.45406         0.43587      80.50653
      # 7     7     1         log(qsec)    0.17774         0.15033     135.42760
      # 8     8     1              qsec    0.17530         0.14781     135.91242
      # 9     9     1         I(qsec^2)    0.17005         0.14239     136.95476
      #10    10     2 log(wt) log(qsec)    0.87935         0.87103      -2.02118
      ## ... with 501 more rows
      

    A few comments:

    1. This very quickly gets very computationally intensive.

    2. The y ~ . formula syntax is very concise, but I haven't found a way to include e.g. quadratic terms. On one hand, y ~ . + (.)^2 works for including all interaction terms; on the other hand, y ~ . + I(.^2) does not work for quadratic terms. That's why I think building terms manually is the way to go.