I have performed a forward selection in R on a very large dataset. According to the summary, I should use 65 out of 75 variables in my model to predict y. Now, I want to run a simple OLS model based on the selected variables. How can I specify my OLS model in R with only the selected variables as proposed by forward selection? Of course, I could manually enter the names of those variables, but this would be quite tedious.
I want to run such a model but only with the selected variables:
fws_model <- glm(y ~ X1 + X2 + ... , data = training_set)
This is how I performed forward selection in R:
library(leaps)
regfit.fwd <- regsubsets(y ~ ., data = training_set, method = "forward", intercept=TRUE, really.big=TRUE, nvmax = 10000)
regfwd.summary <- summary(regfit.fwd)
How about this:
set.seed(1234)
data(mtcars)
library(leaps)
X <- matrix(rnorm(1000000, 0, 1), ncol=100)
colnames(X) <- paste0("x", 1:100)
b <- c(runif(50, .25, 2), rep(0, 50))
yhat <- X %*% b
y <- yhat + rnorm(1000, 0, .5*sd(yhat))
dat<- cbind(data.frame(y=y), as.data.frame(X))
regfit.fwd <- regsubsets(y ~ ., data = dat, method = "forward", intercept=TRUE, nvmax = 10000)
which.max(summary(regfit.fwd)$adjr2)
#> [1] 64
incl <- which(summary(regfit.fwd)$which[64,])
## the [-1] takes out the intercept term from the
## formula, it will be added automatically. Make sure
## if you do this that the variables included
## have (Intercept) in them.
form <- reformulate(names(incl)[-1], response="y")
form
#> y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10 + x11 +
#> x12 + x13 + x14 + x15 + x16 + x17 + x18 + x19 + x20 + x21 +
#> x22 + x23 + x24 + x25 + x26 + x27 + x28 + x29 + x30 + x31 +
#> x32 + x33 + x34 + x35 + x36 + x37 + x38 + x39 + x40 + x41 +
#> x42 + x43 + x44 + x45 + x46 + x47 + x48 + x49 + x50 + x53 +
#> x56 + x60 + x64 + x67 + x72 + x73 + x74 + x81 + x91 + x92 +
#> x95 + x98 + x99
mod <- lm(form, dat)
Created on 2022-11-26 by the reprex package (v2.0.1)