I have a dataset of 690 observations with categorical and numerical variables. I want to perform Lasso regression but when plot Lasso curve it is not smooth and I would like to know if there is a problem with dummies or other. I reproduce an example dataset:
num1 = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
cat1 = c("a", "b", "a", "a", "b", "a", "b", "a", "a", "b")
cat2 = c("gg", "uu", "t", "t", "t", "uu", "uu", "gg", "t", "t")
x=c(0, 0, 1, 1, 0, 0, 0, 1, 1, 0)
ex = data.frame(num1, cat1, cat2, x)
And here is the code:
library(fastDummies)
ex <- dummy_cols(ex, select_columns = c("cat1", "cat2"), remove_first_dummy = TRUE)
xxx <- ex[,1:3]
yyy <- ex$x
unique(yyy)
xxx <- data.matrix(xxx)
library(glmnet)
set.seed(999)
mod.lasso <- cv.glmnet(xxx, yyy,
family='binomial', alpha=1,
parallel=TRUE, standardize=TRUE, type.measure='auc')
Here you can see my plot:
If you look at the output:
library(fastDummies)
ex = data.frame(num1, cat1, cat2, x)
ex <- dummy_cols(ex, select_columns = c("cat1", "cat2"), remove_first_dummy = TRUE)
head(ex)
num1 cat1 cat2 x cat1_b cat2_t cat2_uu
1 1 a gg 0 0 0 0
2 2 b uu 0 1 0 1
3 3 a t 1 0 1 0
4 4 a t 1 0 1 0
5 5 b t 0 1 1 0
6 6 a uu 0 0 0 1
What you need are actually cat1_b
cat2_t
cat2_uu
, basically your categorical columns converted to binary. Taking the first three columns is wrong and you are converting your factors to numeric.
So we can do:
ex = data.frame(num1, cat1, cat2, x)
xxx = dummy_cols(ex, select_columns = c("cat1", "cat2"), remove_first_dummy = TRUE,remove_selected_columns =TRUE)
The part about the AUC curve, you have very little data, and only 15 variables so it might be a bit shakey. You can think of it as, once the lambda gets high (towards the right), you have less non-zero coefficients and the estimate gets shakey. You can try it again on your full dataset and see whether it changes.
Below I use an example dataset, and you can see it works pretty well with the dummy variable:
coln=c('age','workclass','fnlwgt','edu','edu_num','maritial','occ','relationship','race','sex','capital-gain','capital-loss','hours-per-week','country','label')
df = read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",col.names=coln,na.strings = " ?")
df = df[complete.cases(df),]
sel = names(which(sapply(df[,-ncol(df)],is.factor)))
idx = sample(nrow(df),2000)
X = dummy_cols(df[,-ncol(df)],select_columns=sel,
remove_selected_columns =TRUE)[idx,]
Y =as.numeric(df$label)[idx]-1
fit = cv.glmnet(x=as.matrix(X),y=Y,family="binomial",type.measure="auc")
plot(fit)