r dummy-variable glmnet lasso-regression

Dummies in Lasso Regression in R

I have a dataset of 690 observations with categorical and numerical variables. I want to perform Lasso regression but when plot Lasso curve it is not smooth and I would like to know if there is a problem with dummies or other. I reproduce an example dataset:

num1 = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
cat1 = c("a", "b", "a", "a", "b", "a", "b", "a", "a", "b")
cat2 = c("gg", "uu", "t", "t", "t", "uu", "uu", "gg", "t", "t") 
x=c(0, 0, 1, 1, 0, 0, 0, 1, 1, 0)
ex = data.frame(num1, cat1, cat2, x)

And here is the code:

library(fastDummies)
ex <- dummy_cols(ex, select_columns = c("cat1", "cat2"), remove_first_dummy = TRUE)


xxx <- ex[,1:3]
yyy <- ex$x
unique(yyy)

xxx <- data.matrix(xxx)

library(glmnet)
set.seed(999)
mod.lasso <- cv.glmnet(xxx, yyy, 
                         family='binomial', alpha=1, 
                         parallel=TRUE, standardize=TRUE, type.measure='auc')

Here you can see my plot:

Solution

If you look at the output:

library(fastDummies)
ex = data.frame(num1, cat1, cat2, x)
ex <- dummy_cols(ex, select_columns = c("cat1", "cat2"), remove_first_dummy = TRUE)

head(ex)

  num1 cat1 cat2 x cat1_b cat2_t cat2_uu
1    1    a   gg 0      0      0       0
2    2    b   uu 0      1      0       1
3    3    a    t 1      0      1       0
4    4    a    t 1      0      1       0
5    5    b    t 0      1      1       0
6    6    a   uu 0      0      0       1

What you need are actually cat1_b cat2_t cat2_uu, basically your categorical columns converted to binary. Taking the first three columns is wrong and you are converting your factors to numeric.

So we can do:

ex = data.frame(num1, cat1, cat2, x)
xxx = dummy_cols(ex, select_columns = c("cat1", "cat2"), remove_first_dummy = TRUE,remove_selected_columns =TRUE)

The part about the AUC curve, you have very little data, and only 15 variables so it might be a bit shakey. You can think of it as, once the lambda gets high (towards the right), you have less non-zero coefficients and the estimate gets shakey. You can try it again on your full dataset and see whether it changes.

Below I use an example dataset, and you can see it works pretty well with the dummy variable:

coln=c('age','workclass','fnlwgt','edu','edu_num','maritial','occ','relationship','race','sex','capital-gain','capital-loss','hours-per-week','country','label')

df = read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",col.names=coln,na.strings = " ?")
df = df[complete.cases(df),]

sel = names(which(sapply(df[,-ncol(df)],is.factor)))
idx = sample(nrow(df),2000)

X = dummy_cols(df[,-ncol(df)],select_columns=sel,
remove_selected_columns =TRUE)[idx,]

Y =as.numeric(df$label)[idx]-1

fit = cv.glmnet(x=as.matrix(X),y=Y,family="binomial",type.measure="auc")
plot(fit)