Search code examples
rmodelpredictionglmnetlasso-regression

Using glmnet to predict a continuous variable in a dataset


I have this data set. wbh

I wanted to use the R package glmnet to determine which predictors would be useful in predicting fertility. However, I have been unable to do so, most likely due to not having a full understanding of the package. The fertility variable is SP.DYN.TFRT.IN. I want to see which predictors in the data set give the most predictive power for fertility. I wanted to use LASSO or ridge regression to shrink the number of coefficients, and I know this package can do that. I'm just having some trouble implementing it.

I know there are no code snippets which I apologize for but I am rather lost on how I would code this out.

Any advice is appreciated.

Thank you for reading


Solution

  • Here is an example on how to run glmnet:

    library(glmnet)
    library(tidyverse)
    

    df is the data set your provided.

    select y variable:

    y <- df$SP.DYN.TFRT.IN
    

    select numerical variables:

    df %>%
      select(-SP.DYN.TFRT.IN, -region, -country.code) %>%
      as.matrix() -> x
    

    select factor variables and convert to dummy variables:

    df %>%
      select(region, country.code) %>%
      model.matrix( ~ .-1, .) -> x_train
    

    run model(s), several parameters here can be tweaked I suggest checking the documentation. Here I just run 5-fold cross validation to determine the best lambda

    cv_fit <- cv.glmnet(x, y, nfolds = 5) #just with numeric variables
    
    cv_fit_2 <- cv.glmnet(cbind(x ,x_train), y, nfolds = 5) #both factor and numeric variables
    
    par(mfrow = c(2,1))
    plot(cv_fit)
    plot(cv_fit_2)
    

    enter image description here

    best lambda:

    cv_fit$lambda[which.min(cv_fit$cvm)]
    

    coefficients at best lambda

    coef(cv_fit, s = cv_fit$lambda[which.min(cv_fit$cvm)])
    

    equivalent to:

    coef(cv_fit, s = "lambda.min")
    

    after running coef(cv_fit, s = "lambda.min") all features with - in the resulting table are dropped from the model. This situation corresponds to the left lambda depicted with the left vertical dashed line on the plots.
    I suggest reading the linked documentation - elastic nets are quite easy to grasp if you know a bit of linear regression and the package is quite intuitive. I also suggest reading ISLR, at least the part with L1 / L2 regularization. and these videos: 1, 2, 3 4, 5, 6, first three are about estimating model performance via test error and the last three are about the question at hand. This one is how to implement these models in R. By the way these guys on the videos invented LASSO and made glment.

    Also check the glmnetUtils library which provides a formula interface and other nice things like in built mixing parameter (alpha) selection. Here is the vignette.