Search code examples
rsparse-matrixdummy-dataglmnetlasso-regression

Lasso, glmnet, preprocessing of the data


Im trying to use the glmnet package to fit a lasso (L1 penalty) on a model with a binary outcome (a logit). My predictors are all binary (they're 1/0 not ordered, ~4000) except for one continuous variable. I need to convert the predictors into a sparse matrix, since it takes forever and a day otherwise. My question is: it seems that people are using sparse.model.matrix rather than just converting their matrix into a sparse matrix. Why is that? and do I need to do this here? Outcome is a little different for both methods.

Also, do my factors need to be coded as factors (when it comes to both the outcome and the predictors) or is it sufficient to use the sparse matrix and specify in the glmnet model that the outcome is binomial?

Here's what im doing so far

#Create a random dataset, y is outcome, x_d is all the dummies (10 here for simplicity) and x_c  is the  cont variable 
y<- sample(c(1:0), 200, replace = TRUE)
x_d<- matrix(data= sample(c(1:0), 2000, replace = TRUE), nrow=200, ncol=10)
x_c<- sample(60:90, 200, replace = TRUE) 

#FIRST: scale that one cont variable. 
scaled<-scale(x_c,center=TRUE, scale=TRUE)

#then predictors together
x<- cbind(x_d, scaled) 

#HERE'S MY MAIN QUESTION: What i currently do is: 
xt<-Matrix(x ,  sparse = TRUE) 

#then run the cross validation...
cv_lasso_1<-cv.glmnet(xt, y, family="binomial", standardize=FALSE)

#which gives slightly different results from (here the outcome variable is in the x matrix too) 
xt<-sparse.model.matrix(data=x, y~.)

#then run CV. 

So to sum up my 2 questions are: 1-Do i need to use sparse.model.matrix even if my factors are just binary and not ordered? [and if yes what does it actually do differently from just converting the matrix to a sparse matrix] 2- Do i need to code the binary variables as factors? the reason i ask that is my dataset is huge. it saves a lot of time to just do it without coding as factors.


Solution

  • I don't think you need a sparse.model.matrix, as all that it really gives you above a regular matrix is expansion of factor terms, and if you're binary already that won't give you anything. You certainly don't need to code as factors, I frequently use glmnet on a regular (non-model) sparse matrix with only 1's. At the end of the day glmnet is a numerical method, so a factor will get converted to a number in the end regardless.