Search code examples
svmtext-classificationrocmulticlass-classificationtfidfvectorizer

ValueError: multiclass format is not supported on ROC_Curve for text classification


I am trying to use ROC for evaluating my emotion text classifier model

This is my code for the ROC :

# ROC-AUC Curve
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt
fpr, tpr, thresholds = roc_curve(y_test, y_test_hat2)
roc_auc = auc(fpr, tpr)
plt.figure()
plt.plot(fpr, tpr, color='darkorange', lw=1, label='ROC curve (area = %0.2f)' % roc_auc)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC CURVE')
plt.legend(loc="lower right")
plt.show()

This is the Error :

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-30-ef4ee0eff994> in <module>()
      2 from sklearn.metrics import roc_curve, auc
      3 import matplotlib.pyplot as plt
----> 4 fpr, tpr, thresholds = roc_curve(y_test, y_test_hat2)
      5 roc_auc = auc(fpr, tpr)
      6 plt.figure()

1 frames
/usr/local/lib/python3.7/dist-packages/sklearn/metrics/_ranking.py in roc_curve(y_true, y_score, pos_label, sample_weight, drop_intermediate)
    961     """
    962     fps, tps, thresholds = _binary_clf_curve(
--> 963         y_true, y_score, pos_label=pos_label, sample_weight=sample_weight
    964     )
    965 

/usr/local/lib/python3.7/dist-packages/sklearn/metrics/_ranking.py in _binary_clf_curve(y_true, y_score, pos_label, sample_weight)
    729     y_type = type_of_target(y_true)
    730     if not (y_type == "binary" or (y_type == "multiclass" and pos_label is not None)):
--> 731         raise ValueError("{0} format is not supported".format(y_type))
    732 
    733     check_consistent_length(y_true, y_score, sample_weight)

ValueError: multiclass format is not supported

This is the y_test and y_test_hat2 :

y_test = data_test["label"]


from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
test_vectors = vectorizer.transform(data_test['tweet'])
classifier_linear2 = LinearSVC(verbose=1)
y_test_hat2=classifier_linear2.predict(test_vectors)

Shape of test_vectors = (1096, 11330)

Shape of y_test = (1096,)

Label in y_test = ['0', '1', '2', '3', '4']


Solution

  • A ROC curve is based on soft predictions, i.e. it uses the predicted probability of an instance to belong to the positive class rather than the predicted class. For example with sklearn one can obtain the probabilities with predict_proba instead of predict (for the classifiers which provide it, example).

    Note: OP used the tag multiclass-classification, but it's important to note that ROC curves can only be applied to binary classification problems.

    One can find a short explanation of ROC curves here.