Search code examples
rdummy-variableglmnetlasso-regression

Dummies in Lasso Regression in R


I have a dataset of 690 observations with categorical and numerical variables. I want to perform Lasso regression but when plot Lasso curve it is not smooth and I would like to know if there is a problem with dummies or other. I reproduce an example dataset:

num1 = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
cat1 = c("a", "b", "a", "a", "b", "a", "b", "a", "a", "b")
cat2 = c("gg", "uu", "t", "t", "t", "uu", "uu", "gg", "t", "t") 
x=c(0, 0, 1, 1, 0, 0, 0, 1, 1, 0)
ex = data.frame(num1, cat1, cat2, x)

And here is the code:

library(fastDummies)
ex <- dummy_cols(ex, select_columns = c("cat1", "cat2"), remove_first_dummy = TRUE)


xxx <- ex[,1:3]
yyy <- ex$x
unique(yyy)

xxx <- data.matrix(xxx)

library(glmnet)
set.seed(999)
mod.lasso <- cv.glmnet(xxx, yyy, 
                         family='binomial', alpha=1, 
                         parallel=TRUE, standardize=TRUE, type.measure='auc')

Here you can see my plot:

enter image description here


Solution

  • If you look at the output:

    library(fastDummies)
    ex = data.frame(num1, cat1, cat2, x)
    ex <- dummy_cols(ex, select_columns = c("cat1", "cat2"), remove_first_dummy = TRUE)
    
    head(ex)
    
      num1 cat1 cat2 x cat1_b cat2_t cat2_uu
    1    1    a   gg 0      0      0       0
    2    2    b   uu 0      1      0       1
    3    3    a    t 1      0      1       0
    4    4    a    t 1      0      1       0
    5    5    b    t 0      1      1       0
    6    6    a   uu 0      0      0       1
    

    What you need are actually cat1_b cat2_t cat2_uu, basically your categorical columns converted to binary. Taking the first three columns is wrong and you are converting your factors to numeric.

    So we can do:

    ex = data.frame(num1, cat1, cat2, x)
    xxx = dummy_cols(ex, select_columns = c("cat1", "cat2"), remove_first_dummy = TRUE,remove_selected_columns =TRUE)
    

    The part about the AUC curve, you have very little data, and only 15 variables so it might be a bit shakey. You can think of it as, once the lambda gets high (towards the right), you have less non-zero coefficients and the estimate gets shakey. You can try it again on your full dataset and see whether it changes.

    Below I use an example dataset, and you can see it works pretty well with the dummy variable:

    coln=c('age','workclass','fnlwgt','edu','edu_num','maritial','occ','relationship','race','sex','capital-gain','capital-loss','hours-per-week','country','label')
    
    df = read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",col.names=coln,na.strings = " ?")
    df = df[complete.cases(df),]
    
    sel = names(which(sapply(df[,-ncol(df)],is.factor)))
    idx = sample(nrow(df),2000)
    
    X = dummy_cols(df[,-ncol(df)],select_columns=sel,
    remove_selected_columns =TRUE)[idx,]
    
    Y =as.numeric(df$label)[idx]-1
    
    fit = cv.glmnet(x=as.matrix(X),y=Y,family="binomial",type.measure="auc")
    plot(fit)
    

    enter image description here