Search code examples
pythonclassificationregularized

How to perform elastic-net for a classification problem?


I am a noob and I have previously tackled a linear regression problem using regularised methods. That was all pretty straight forward but I now want to use elastic net on a classification problem.

I have run a baseline logistic regression model and the prediction scores are decent (accuracy and f1 score of ~80%). I know that some of my input features are highly correlated and I suspect that I am introducing multicollinearity, hence why I want to run an elastic net to see the impact on the coefficients and compare against the baseline.

I have done some googling and I understand I need to use SGDClassifier function for regularised logistics regression model. Is this the best way to perform this analysis and can anyone point me in the direction of a basic example with cross validation?


Solution

  • Interesting question.

    Your question really should be broken up into multiple other questions, such as:

    • How can I tell if my data is collinear?
    • How to deal with collinear data in a machine learning problem?
    • How can I convert logistic regression to elasticnet for classification?

    I am going to focus on the third bullet above.

    Additionally, there is no sample data, or even a minimum, complete, reproducible example of code for us to go off of, so I am going to make some assumptions below.

    How can I use logistic regression for classification?

    What's the difference between logistic regression and elasticnet?

    First, let's understand what is different about logistic regression vs elasticnet. This TowardsDataScience article is fairly well written and goes into the details a little bit, and you should review it if you are unfamiliar. In short,

    Logistic Regression does not penalize the model for its weight choices, while elasticnet includes absolute value, and squared penalization tactics which are regularized with an l1_ratio coefficient.

    What does that difference look like in code?

    You can review the source code for Logistic Regression here, but in short, lines 794-796 show the alpha and beta values changing when the penalty type is elasticnet:

    What does this mean for an example?

    Below is an example of implementing this in code using sklearn's Logistic Regression. Some notes:

    • I am using cross validation as requested, and have set it to 3 folds
    • I would take this performance with a grain of salt -- there is a lot of feature engineering which should be done, and parameters such as the l1_ratios should absolutely be investigated. These values were totally arbitrary.

    Produces outputs that look like:

    Logistic Regression: 0.972027972027972 || Elasticnet: 0.9090909090909091
    
    Logistic Regression
                  precision    recall  f1-score   support
    
               0       0.96      0.96      0.96        53
               1       0.98      0.98      0.98        90
    
        accuracy                           0.97       143
       macro avg       0.97      0.97      0.97       143
    weighted avg       0.97      0.97      0.97       143
    
    Elastic Net
                  precision    recall  f1-score   support
    
               0       0.93      0.81      0.87        53
               1       0.90      0.97      0.93        90
    
        accuracy                           0.91       143
       macro avg       0.92      0.89      0.90       143
    weighted avg       0.91      0.91      0.91       143
    

    Code below:

    # Load libraries
    
    # Load a toy dataset
    from sklearn.datasets import load_breast_cancer
    
    # Load the LogisticRegression classifier
    # Note, use CV for cross-validation as requested in the question
    from sklearn.linear_model import LogisticRegressionCV
    
    # Load some other sklearn functions
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import classification_report
    
    # Import other libraries
    import pandas as pd, numpy as np
    
    # Load the breast cancer dataset
    X, y = load_breast_cancer(return_X_y=True, as_frame=True)
    
    # Create your training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=2)
    
    # Basic LogisticRegression algorithm
    logistic_regression_classifier = LogisticRegressionCV(cv=3)
    # SAGA should be considered more advanced and used over SAG. For more information, see: https://stackoverflow.com/questions/38640109/logistic-regression-python-solvers-defintions
    # Note, you should probably tune this, these values are arbitrary
    elastic_net_classifier = LogisticRegressionCV(cv=3, penalty='elasticnet', l1_ratios=[0.1, 0.5, 0.9], solver='saga')
    
    # Train the models
    logistic_regression_classifier.fit(X_train, y_train)
    elastic_net_classifier.fit(X_train, y_train)
    
    # Test the models
    print("Logistic Regression: {} || Elasticnet: {}".format(logistic_regression_classifier.score(X_test, y_test), elastic_net_classifier.score(X_test, y_test)))
    
    # Print out some more metrics
    print("Logistic Regression")
    print(classification_report(y_test, logistic_regression_classifier.predict(X_test)))
    print("Elastic Net")
    print(classification_report(y_test, elastic_net_classifier.predict(X_test)))
    

    There is alternatively another method you can use, similarly to how the RidgeClassifierCV functions, but we would need to write a bit of a wrapper around that as sklearn has not provided that.