Search code examples
pythonscikit-learnlogistic-regressionlasso-regressionregularized

Lasso regression won't remove 2 features which are highly correlated


I have two features, say F1 and F2 which has a correlation of about 0.9.

When I built my model, I first considered all the features to go into my regression model. Once I have my model, I then ran Lasso regression on my model, with the hope that this will tackle any colinearity between the features. However, the Lasso regression kept both F1 and F2 in my model.

Two questions:

i) If F1 and F2 are highly correlated, but Lasso regression still kept both of them, what could this mean? Does it mean regularization doesn't work in some cases?

ii) How do I adjust my model or the Lasso regression model to kick out F1 or F2 in my model? (I am using sklearn.linear_model.LogisticRegression, and have set penalty = 'l1' or ‘elasticnet’, tried very large or very small C values, tried 'liblinear' or 'saga' solvers, and l1_ratio = 1, but I still can't kick out either F1 or F2 from my model)


Solution

  • Answers to your questions:

    i) Lasso reduces coefficients gradually. You may find a nice picture in some books authored by Robert Tibshirani, the person behind the Lasso/Ridge, where you will see how some coefficients gradually fall to zero as regularization coefficient is increasing (you may perform such an experiment yourself). The fact the model still keeps both may mean two things: either the model deems both important or there no enough regularization to kill one of them.

    ii) You're right you're going with Lasso with L1 regularization. It is C parameter. The way it's coded in sklearn: the smaller the C the higher the regularization parameter (inverse). Though in machine learning your task is not to totally exclude collinearity ("to kill F1 or F2" in your parlor), but to find a model (or a set of params if you wish) that will generalize best. That is done through model tuning via CV. Warning: higher regularization means more underfitting.

    I would add though that collinearity is somewhat dangerous for linear regression because it may give rise to model instability (differing coefficients on different subsamples). So, with linear regression, you may wish to check this too.