Search code examples
pythonscikit-learnconfusion-matrix

Why does scikit-learn ConfusionMatrixDisplay shift order of labels if a class has no predicted samples?


I have a model that can predict 10 classes. The problem is that I have predicted 26 samples, but none of them belongs to class 3 (= 'jumping_jacks'). So, neither my labels y_test, nor my predictions y_pred contain this class. In this case I normally would expect the confusion matrix to show the row "True label jumping_jacks" full of zeros, as well as the column "Predicted Label jumping_jacks" full of zeros.

However, it does show predictions for class 3. Those predictions are actually the predictions for class 4 (='lateral_shoulder_raises'). So everything is shifted, starting from the third row/column, up until the end. This is also the reason why the matrix does not contain results for class 9 (= 'tricep_extensions'), although y_test and y_pred contain this class.

How can I fix this?

Reproducible Code:

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.preprocessing import LabelEncoder
import matplotlib.pyplot as plt

ex_classes = {'Classes': ['bicep_curls', 'dumbbell_rows', 'dumbbell_shoulder_press', 'jumping_jacks',
       'lateral_shoulder_raises', 'lunges', 'pushups', 'situps', 'squats',
       'tricep_extensions']}
df_classes = pd.DataFrame(data=ex_classes)
label_enc = LabelEncoder()
label_enc.fit(df_classes['Classes'])

y_test = np.asarray([8, 8, 8, 6, 6, 6, 2, 2, 2, 5, 5, 5, 1, 1, 1, 7, 7, 7, 9, 9, 9, 0, 0, 0, 0, 4])
y_pred = np.asarray([8, 4, 4, 6, 6, 6, 2, 2, 2, 5, 5, 5, 1, 1, 1, 9, 7, 7, 9, 9, 9, 0, 0, 1, 0, 4])
cm = confusion_matrix(y_test, y_pred)
display = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels = label_enc.classes_)
fig, ax = plt.subplots(figsize=(10,10))
display.plot(ax=ax, xticks_rotation='vertical')
plt.show()

My Output: enter image description here


Solution

  • You need to specify labels when calculating confusion matrix:

    cm = confusion_matrix(y_test, y_pred, labels=np.arange(len(df_classes)))
    

    No predictions or ground truth labels contain label 3 so sklearn internally shifts the labels:

        # If labels are not consecutive integers starting from zero, then
        # y_true and y_pred must be converted into index form
    

    https://github.com/scikit-learn/scikit-learn/blob/21829b5ddb8f50292dd302fff5c9aad1c4b1998a/sklearn/metrics/_classification.py#L335

    Results with specified labels confusion_matrix(..., labels=):

    enter image description here

    Full example:

    import matplotlib.pyplot as plt
    import numpy as np
    import pandas as pd
    from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix
    from sklearn.preprocessing import LabelEncoder
    
    ex_classes = {
        "Classes": [
            "bicep_curls",
            "dumbbell_rows",
            "dumbbell_shoulder_press",
            "jumping_jacks",
            "lateral_shoulder_raises",
            "lunges",
            "pushups",
            "situps",
            "squats",
            "tricep_extensions",
        ]
    }
    df_classes = pd.DataFrame(data=ex_classes)
    label_enc = LabelEncoder()
    label_enc.fit(df_classes["Classes"])
    
    y_test = np.asarray(
        [8, 8, 8, 6, 6, 6, 2, 2, 2, 5, 5, 5, 1, 1, 1, 7, 7, 7, 9, 9, 9, 0, 0, 0, 0, 4]
    )
    y_pred = np.asarray(
        [8, 4, 4, 6, 6, 6, 2, 2, 2, 5, 5, 5, 1, 1, 1, 9, 7, 7, 9, 9, 9, 0, 0, 1, 0, 4]
    )
    cm = confusion_matrix(y_test, y_pred, labels=np.arange(len(df_classes)))
    display = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=label_enc.classes_)
    fig, ax = plt.subplots(figsize=(10, 10))
    display.plot(ax=ax, xticks_rotation="vertical")
    plt.show()