Search code examples
rstatisticsglmnet

Using glmnet to find optimized model for a given number of predictors


I'm attempting to use LASSO for a slightly different function than originally designed. I have a 22 different tasks on a test, which, when averaged, produce a final score. I want to see which combination of a limited number of tasks would best predict the overall score with the hopes of creating a short form of this test.

I'm using glmnet to run the lasso next, and it runs as expected. I can then easily find the model at a given lamda with

coef(cvfit, s = s)

However, I am wondering if it would be possible to specify the n of predictors that have non-zero coefficients, rather than the penalization parameter?

I've set up a very inefficient way of doing this as shown below by extracting the models from a grid of test lambdas, but I was wondering if there is a more efficient way of doing this

nvar <- list()
coeffs <- list()

for(j in 1:20000) {

  s <- j / 20000

  coeffs[j] <- coef(cvfit, s = s) ##Get coefficient list at given lamda

  nvar[j] <- sum(as.vector(coef(cvfit, s = s)) != 0) - 1 ##Count number of variables with non-zero coeff and subtract one because intercept is always non-zero

}

nvar <- unlist(nvar)

getlamda <- function(numvar = 4) {

  min.lambda <- min(lambdas[nvar == numvar]) / 20000 ##Find the smallest lambda which resulted in the given number of non-zero coefficients

  coeffs[min.lambda]

}

Solution

  • After playing with Blended's solution above, I realized that there is an even simpler way of doing this.

    Using the Boston dataset used in the example:

    library(dplyr)
    library(glmnet)
    
    (boston <- MASS::Boston %>% tbl_df())
    
    tr_x <- model.matrix(medv ~ ., data = boston)[,-1]
    tr_y <- boston$medv
    cvfit <- glmnet(tr_x, tr_y)
    

    The cvfit object already has all the components we need to find the answer for a given number of variables. df is the number of degrees of freedom, and is the number of variable parameter in which we are interested. lambda is the lambda for each model.

    So we can create a simple function that returns the best model for a given number of variables.

    get_nparam <- function(mod, numvar) {
    
      coef(mod, s = with(cvfit, min(lambda[df == numvar])))
    
    }
    
    get_nparam(cvfit, 4)
    
    #14 x 1 sparse Matrix of class "dgCMatrix"
    #                       1
    #(Intercept) 15.468034114
    #crim         .          
    #zn           .          
    #indus        .          
    #chas         .          
    #nox          .          
    #rm           3.816165372
    #age          .          
    #dis          .          
    #rad          .          
    #tax          .          
    #ptratio     -0.606026131
    #black        0.001518042
    #lstat       -0.495954410
    #
    

    Thanks again to Blender for a different solution and putting me on the path to this.