I am a noob and I have previously tackled a linear regression problem using regularised methods. That was all pretty straight forward but I now want to use elastic net on a classification problem.
I have run a baseline logistic regression model and the prediction scores are decent (accuracy and f1 score of ~80%). I know that some of my input features are highly correlated and I suspect that I am introducing multicollinearity, hence why I want to run an elastic net to see the impact on the coefficients and compare against the baseline.
I have done some googling and I understand I need to use SGDClassifier function for regularised logistics regression model. Is this the best way to perform this analysis and can anyone point me in the direction of a basic example with cross validation?
Interesting question.
Your question really should be broken up into multiple other questions, such as:
I am going to focus on the third bullet above.
Additionally, there is no sample data, or even a minimum, complete, reproducible
example of code for us to go off of, so I am going to make some assumptions below.
First, let's understand what is different about logistic regression vs elasticnet. This TowardsDataScience article is fairly well written and goes into the details a little bit, and you should review it if you are unfamiliar. In short,
Logistic Regression does not penalize the model for its weight choices, while elasticnet includes absolute value, and squared penalization tactics which are regularized with an
l1_ratio
coefficient.
You can review the source code for Logistic Regression here, but in short, lines 794-796
show the alpha
and beta
values changing when the penalty type is elasticnet:
Below is an example of implementing this in code using sklearn's Logistic Regression
. Some notes:
l1_ratios
should absolutely be investigated. These values were totally arbitrary.Produces outputs that look like:
Logistic Regression: 0.972027972027972 || Elasticnet: 0.9090909090909091
Logistic Regression
precision recall f1-score support
0 0.96 0.96 0.96 53
1 0.98 0.98 0.98 90
accuracy 0.97 143
macro avg 0.97 0.97 0.97 143
weighted avg 0.97 0.97 0.97 143
Elastic Net
precision recall f1-score support
0 0.93 0.81 0.87 53
1 0.90 0.97 0.93 90
accuracy 0.91 143
macro avg 0.92 0.89 0.90 143
weighted avg 0.91 0.91 0.91 143
Code below:
# Load libraries
# Load a toy dataset
from sklearn.datasets import load_breast_cancer
# Load the LogisticRegression classifier
# Note, use CV for cross-validation as requested in the question
from sklearn.linear_model import LogisticRegressionCV
# Load some other sklearn functions
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
# Import other libraries
import pandas as pd, numpy as np
# Load the breast cancer dataset
X, y = load_breast_cancer(return_X_y=True, as_frame=True)
# Create your training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=2)
# Basic LogisticRegression algorithm
logistic_regression_classifier = LogisticRegressionCV(cv=3)
# SAGA should be considered more advanced and used over SAG. For more information, see: https://stackoverflow.com/questions/38640109/logistic-regression-python-solvers-defintions
# Note, you should probably tune this, these values are arbitrary
elastic_net_classifier = LogisticRegressionCV(cv=3, penalty='elasticnet', l1_ratios=[0.1, 0.5, 0.9], solver='saga')
# Train the models
logistic_regression_classifier.fit(X_train, y_train)
elastic_net_classifier.fit(X_train, y_train)
# Test the models
print("Logistic Regression: {} || Elasticnet: {}".format(logistic_regression_classifier.score(X_test, y_test), elastic_net_classifier.score(X_test, y_test)))
# Print out some more metrics
print("Logistic Regression")
print(classification_report(y_test, logistic_regression_classifier.predict(X_test)))
print("Elastic Net")
print(classification_report(y_test, elastic_net_classifier.predict(X_test)))
There is alternatively another method you can use, similarly to how the RidgeClassifierCV
functions, but we would need to write a bit of a wrapper around that as sklearn
has not provided that.