Search code examples
pythonpandasnumpyscikit-learnpca

Calculating AUC for LogisticRegression model


Let's take data

import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.decomposition import PCA
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn import metrics

data = load_breast_cancer()
X = data.data
y = data.target

I want to create model using only first principal component and calculate AUC for it.

My work so far

scaler = StandardScaler()
scaler.fit(X_train) 
X_scaled = scaler.transform(X)
pca = PCA(n_components=1)
principalComponents = pca.fit_transform(X)
principalDf = pd.DataFrame(data = principalComponents
             , columns = ['principal component 1'])
clf = LogisticRegression()
clf = clf.fit(principalDf, y)
pred = clf.predict_proba(principalDf)

But while I'm trying to use

fpr, tpr, thresholds = metrics.roc_curve(y, pred, pos_label=2)

Following error occurs :

y should be a 1d array, got an array of shape (569, 2) instead.

I tried to reshape my data

fpr, tpr, thresholds = metrics.roc_curve(y.reshape(1,-1), pred, pos_label=2)

But it didn't solve the issue (it outputs) :

multilabel-indicator format is not supported

Do you have any idea how can I perform AUC on this first principal component?


Solution

  • You may wish to try:

    from sklearn.datasets import load_breast_cancer
    from sklearn.decomposition import PCA
    from sklearn import datasets
    from sklearn.preprocessing import StandardScaler
    from sklearn import metrics
    from sklearn.model_selection import train_test_split
    from sklearn.linear_model import LogisticRegression
    from sklearn.pipeline import Pipeline
    
    X,y = load_breast_cancer(return_X_y=True)
    X_train, X_test, y_train, y_test = train_test_split(X,y)
    scaler = StandardScaler()
    pca = PCA(2)
    clf = LogisticRegression()
    ppl = Pipeline([("scaler",scaler),("pca",pca),("clf",clf)])
    ppl.fit(X_train, y_train)
    preds = ppl.predict(X_test)
    
    fpr, tpr, thresholds = metrics.roc_curve(y_test, preds, pos_label=1)
    metrics.plot_roc_curve(ppl, X_test, y_test)
    

    enter image description here