I am analyzing the Secom dataset from the UCI Machine Learning repository by lasso-regularized logistic regression, but the results are bad.
https://archive.ics.uci.edu/ml/datasets/SECOM
Characteristics:
The goal is to predict the positive class accurately, and also to perform feature selection.
I optimize lambda by 10 fold cross validation with the glmnet package in R. But the results are terrible, as the model tends to assign all test samples to only a single class.
Is it simply the wrong kind of model for this dataset?
Predicting with imbalanced classes can be a very difficult problem to solve. Thankfully, there is a huge bibliography on how to solve such problems. There is a particular one that worked really well for me. It involves using up-sampling and/or down-sampling techniques:
down-sampling: randomly subset all the classes in the training set so that their class frequencies match the least prevalent class. For example, suppose that 80% of the training set samples are the first class and the remaining 20% are in the second class. Down-sampling would randomly sample the first class to be the same size as the second class (so that only 40% of the total training set is used to fit the model). caret contains a function (downSample) to do this.
up-sampling: randomly sample (with replacement) the minority class to be the same size as the majority class. caret contains a function (upSample) to do this.
hybrid methods: techniques such as SMOTE and ROSE down-sample the majority class and synthesize new data points in the minority class. There are two packages (DMwR and ROSE) that implement these procedures.
I took the above bullet points from this caret's documentation. The post contains examples about each one of the above bullet points and R code. You should be able to use a Lasso logistic regression and have better results after you have transformed your data based on the above techniques.