Search code examples
rmodellinear-regressionglmfeature-selection

How to run the final model based on forward selection in R?


I have performed a forward selection in R on a very large dataset. According to the summary, I should use 65 out of 75 variables in my model to predict y. Now, I want to run a simple OLS model based on the selected variables. How can I specify my OLS model in R with only the selected variables as proposed by forward selection? Of course, I could manually enter the names of those variables, but this would be quite tedious.

I want to run such a model but only with the selected variables: fws_model <- glm(y ~ X1 + X2 + ... , data = training_set)

This is how I performed forward selection in R:

library(leaps)
regfit.fwd <- regsubsets(y ~ ., data = training_set, method = "forward", intercept=TRUE, really.big=TRUE, nvmax = 10000)
regfwd.summary <- summary(regfit.fwd)

Solution

  • How about this:

    set.seed(1234)
    data(mtcars)
    library(leaps)
    X <- matrix(rnorm(1000000, 0, 1), ncol=100)
    colnames(X) <- paste0("x", 1:100)
    b <- c(runif(50, .25, 2), rep(0, 50))
    yhat <- X %*% b
    y <- yhat + rnorm(1000, 0, .5*sd(yhat))
    dat<- cbind(data.frame(y=y), as.data.frame(X))
    
    regfit.fwd <- regsubsets(y ~ ., data = dat, method = "forward", intercept=TRUE, nvmax = 10000)
    which.max(summary(regfit.fwd)$adjr2)
    #> [1] 64
    incl <- which(summary(regfit.fwd)$which[64,])
    ## the [-1] takes out the intercept term from the 
    ## formula, it will be added automatically.  Make sure 
    ## if you do this that the variables included 
    ## have (Intercept) in them. 
    form <- reformulate(names(incl)[-1], response="y")
    form
    #> y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10 + x11 + 
    #>     x12 + x13 + x14 + x15 + x16 + x17 + x18 + x19 + x20 + x21 + 
    #>     x22 + x23 + x24 + x25 + x26 + x27 + x28 + x29 + x30 + x31 + 
    #>     x32 + x33 + x34 + x35 + x36 + x37 + x38 + x39 + x40 + x41 + 
    #>     x42 + x43 + x44 + x45 + x46 + x47 + x48 + x49 + x50 + x53 + 
    #>     x56 + x60 + x64 + x67 + x72 + x73 + x74 + x81 + x91 + x92 + 
    #>     x95 + x98 + x99
    mod <- lm(form, dat)
    

    Created on 2022-11-26 by the reprex package (v2.0.1)