Search code examples
pythonmachine-learningscikit-learnconfusion-matrix

Incorrect labels in confusion matrix


I have tried to create a confusion matrix on a knn-classifier in python, but the labeled classes are wrong.

The classes attribute of the dataset is 2 (for benign) and 4 (for malignant), but when I plot the confusion matrix, all labels are 2. The code I use is:

Data source: http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29

KNN classifier on Breast Cancer Wisconsin (Diagnostic) Data Set from UCI:

data = pd.read_csv('/breast-cancer-wisconsin.data')
data.replace('?', 0, inplace=True)
data.drop('id', 1, inplace = True)


X = np.array(data.drop(' class ', 1))
Y = np.array(data[' class '])

X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size=0.2)
clf = neighbors.KNeighborsClassifier()
clf.fit(X_train, Y_train)

accuracy = clf.score(X_test, Y_test)

Plot confusion matrix

from sklearn.metrics import plot_confusion_matrix

disp = plot_confusion_matrix(clf, X_test, Y_test,
                               display_labels=Y,
                               cmap=plt.cm.Blues,)

Confusion matrix


Solution

  • The problem is that you're specifying the display_labels argument with Y, where it should just be the target names used for plotting. Now it's just using the two first values that appear in Y, which happen to be 2, 2. Note too that, as mentioned in the docs, the displayed labels will be the same as specified in labels if it is provided, so you just need:

    from sklearn.metrics import plot_confusion_matrix
    fig, ax = plt.subplots(figsize=(8,8))
    disp = plot_confusion_matrix(clf, X_test, Y_test,
                                   labels=np.unique(y),
                                   cmap=plt.cm.Blues,ax=ax)
    

    enter image description here