Search code examples
pythonmachine-learningscikit-learnnlpauc

How do I reshape data to calculate ROC and AUC for binary text classification?


I'm very new to python and need to calculate the ROC and AUC of two binary classification models using NLP data. I can't seem to get my head around sparse vs dense arrays (I mean, I get that sparse arrays contain a ton of zeros, and dense arrays do not), data shape, and dimensionality.

I think I can produce pretty good preprocessed data, but inputting that into my classifiers in a way they can read has me stymied.

In my code below, you'll note that I have tried more than one train test split. I get

TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.

if I don't convert x and y to dense.

I get

ValueError: y should be a 1d array, got an array of shape (1594, 286579) instead

UndefinedMetricWarning: No positive samples in y_true, true positive value should be meaningless

when I do the dense conversion.

And I get

ValueError: Found input variables with inconsistent numbers of samples: [1594, 399]

when (if I'm remembering correctly) using the commented out train test split.

Here is my messy, redundant code:

import joblib
import re
import string
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score, cohen_kappa_score, f1_score, classification_report
from sklearn.model_selection import StratifiedKFold, train_test_split
from sklearn.naive_bayes import MultinomialNB

categories = ['rec.sport.baseball', 'rec.sport.hockey']

news_group_data = fetch_20newsgroups(subset="all", remove=("headers", "footers", "quotes"), categories=categories)

df = pd.DataFrame(dict(text=news_group_data["data"],target=news_group_data["target"]))
df["target"] = df.target.map(lambda x: categories[x])

def process_text(text):
    text = str(text).lower()
    text = re.sub(f"[{re.escape(string.punctuation)}]", " ", text)
    text = " ".join(text.split())
    return text

df["clean_text"] = df.text.map(process_text)

#df_train, df_test = train_test_split(df, test_size=0.20, stratify=df.target)

vec = CountVectorizer(ngram_range=(1, 3), stop_words="english",)

x = vec.fit_transform(df.clean_text)
y = vec.transform(df.clean_text)


#X = vec.fit_transform(df_train.clean_text)
#Y = vec.transform(df_test.clean_text)

X = x.toarray()
Y = y.toarray()

#y_train = df_train.target
#y_test = df_test.target

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=.2,
                                                    random_state=0)

from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features=5,
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, #min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=500,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

nb = GaussianNB()
nb.fit(X_train, Y_train)

r_probs = [0 for _ in range(len(Y_test))]
rf_probs = rf.predict_proba(X_test)
nb_probs = nb.predict_proba(X_test)

rf_probs = rf_probs[:, 1]
nb_probs = nb_probs[:, 1]

from sklearn.metrics import roc_curve, roc_auc_score

r_auc = roc_auc_score(Y_test, r_probs)
rf_auc = roc_auc_score(Y_test, rf_probs)
nb_auc = roc_auc_score(Y_test, nb_probs)

print('Random (chance) Prediction: AUROC = %.3f' % (r_auc))
print('Random Forest: AUROC = %.3f' % (rf_auc))
print('Naive Bayes: AUROC = %.3f' % (nb_auc))

r_fpr, r_tpr, _ = roc_curve(Y_test, r_probs)
rf_fpr, rf_tpr, _ = roc_curve(Y_test, rf_probs)
nb_fpr, nb_tpr, _ = roc_curve(Y_test, nb_probs)

import matplotlib.pyplot as plt

plt.plot(r_fpr, r_tpr, linestyle='--', label='Random prediction (AUROC = %0.3f)' % r_auc)
plt.plot(rf_fpr, rf_tpr, marker='.', label='Random Forest (AUROC = %0.3f)' % rf_auc)
plt.plot(nb_fpr, nb_tpr, marker='.', label='Naive Bayes (AUROC = %0.3f)' % nb_auc)

plt.title('ROC Plot')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend()
plt.show()

Solution

  • The problem is that you are not using the correct target. You are basically encoding two times the text with the CountVectorizer, in these lines:

    x = vec.fit_transform(df.clean_text)
    y = vec.transform(df.clean_text)
    

    Instead you should encode the binary class in df.target as target for the model (your Y)

    def labeling(v):
        if v == categories[0]:
            return 0
        else:
            return 1
    
    df["target_encod"] = df.target.map(labeling)
    print(df['target_encod'])
    

    after that you can use the correct y for your machine learning problem

    X = x.toarray()
    Y = df["target_encod"].values
    

    My result after the changes:

    AUROC

    For the next question, you forgot to assign a variable to the randomForest instance

    RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features=5,
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, #min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=500,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)
    

    instead of

    rf = RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features=5,
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, #min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=500,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)