I'm attempting to use LASSO for a slightly different function than originally designed. I have a 22 different tasks on a test, which, when averaged, produce a final score. I want to see which combination of a limited number of tasks would best predict the overall score with the hopes of creating a short form of this test.
I'm using glmnet to run the lasso next, and it runs as expected. I can then easily find the model at a given lamda with
coef(cvfit, s = s)
However, I am wondering if it would be possible to specify the n of predictors that have non-zero coefficients, rather than the penalization parameter?
I've set up a very inefficient way of doing this as shown below by extracting the models from a grid of test lambdas, but I was wondering if there is a more efficient way of doing this
nvar <- list()
coeffs <- list()
for(j in 1:20000) {
s <- j / 20000
coeffs[j] <- coef(cvfit, s = s) ##Get coefficient list at given lamda
nvar[j] <- sum(as.vector(coef(cvfit, s = s)) != 0) - 1 ##Count number of variables with non-zero coeff and subtract one because intercept is always non-zero
}
nvar <- unlist(nvar)
getlamda <- function(numvar = 4) {
min.lambda <- min(lambdas[nvar == numvar]) / 20000 ##Find the smallest lambda which resulted in the given number of non-zero coefficients
coeffs[min.lambda]
}
After playing with Blended's solution above, I realized that there is an even simpler way of doing this.
Using the Boston dataset used in the example:
library(dplyr)
library(glmnet)
(boston <- MASS::Boston %>% tbl_df())
tr_x <- model.matrix(medv ~ ., data = boston)[,-1]
tr_y <- boston$medv
cvfit <- glmnet(tr_x, tr_y)
The cvfit
object already has all the components we need to find the answer for a given number of variables. df
is the number of degrees of freedom, and is the number of variable parameter in which we are interested. lambda
is the lambda for each model.
So we can create a simple function that returns the best model for a given number of variables.
get_nparam <- function(mod, numvar) {
coef(mod, s = with(cvfit, min(lambda[df == numvar])))
}
get_nparam(cvfit, 4)
#14 x 1 sparse Matrix of class "dgCMatrix"
# 1
#(Intercept) 15.468034114
#crim .
#zn .
#indus .
#chas .
#nox .
#rm 3.816165372
#age .
#dis .
#rad .
#tax .
#ptratio -0.606026131
#black 0.001518042
#lstat -0.495954410
#
Thanks again to Blender for a different solution and putting me on the path to this.