r model prediction glmnet lasso-regression

Using glmnet to predict a continuous variable in a dataset

I have this data set. wbh

I wanted to use the R package glmnet to determine which predictors would be useful in predicting fertility. However, I have been unable to do so, most likely due to not having a full understanding of the package. The fertility variable is SP.DYN.TFRT.IN. I want to see which predictors in the data set give the most predictive power for fertility. I wanted to use LASSO or ridge regression to shrink the number of coefficients, and I know this package can do that. I'm just having some trouble implementing it.

I know there are no code snippets which I apologize for but I am rather lost on how I would code this out.

Any advice is appreciated.

Thank you for reading

Solution

Here is an example on how to run glmnet:

library(glmnet)
library(tidyverse)

df is the data set your provided.

select y variable:

y <- df$SP.DYN.TFRT.IN

select numerical variables:

df %>%
  select(-SP.DYN.TFRT.IN, -region, -country.code) %>%
  as.matrix() -> x

select factor variables and convert to dummy variables:

df %>%
  select(region, country.code) %>%
  model.matrix( ~ .-1, .) -> x_train

run model(s), several parameters here can be tweaked I suggest checking the documentation. Here I just run 5-fold cross validation to determine the best lambda

cv_fit <- cv.glmnet(x, y, nfolds = 5) #just with numeric variables

cv_fit_2 <- cv.glmnet(cbind(x ,x_train), y, nfolds = 5) #both factor and numeric variables

par(mfrow = c(2,1))
plot(cv_fit)
plot(cv_fit_2)

best lambda:

cv_fit$lambda[which.min(cv_fit$cvm)]

coefficients at best lambda

coef(cv_fit, s = cv_fit$lambda[which.min(cv_fit$cvm)])

equivalent to:

coef(cv_fit, s = "lambda.min")

after running coef(cv_fit, s = "lambda.min") all features with - in the resulting table are dropped from the model. This situation corresponds to the left lambda depicted with the left vertical dashed line on the plots.
I suggest reading the linked documentation - elastic nets are quite easy to grasp if you know a bit of linear regression and the package is quite intuitive. I also suggest reading ISLR, at least the part with L1 / L2 regularization. and these videos: 1, 2, 3 4, 5, 6, first three are about estimating model performance via test error and the last three are about the question at hand. This one is how to implement these models in R. By the way these guys on the videos invented LASSO and made glment.

Also check the glmnetUtils library which provides a formula interface and other nice things like in built mixing parameter (alpha) selection. Here is the vignette.