machine-learning classification knn confusion-matrix

kNN Consistently Overusing One Label

I am using a kNN to do some classification of labeled images. After my classification is done, I am outputting a confusion matrix. I noticed that one label, bottle was being applied incorrectly more often.

I removed the label and tested again, but then noticed that another label, shoe was being applied incorrectly, but was fine last time.

There should be no normalization, so I'm unsure what is causing this behavior. Testing showed it continued no matter how many labels I removed. Not totally sure how much code to post, so I'll put some things that should be relevant and pastebin the rest.

def confusionMatrix(classifier, train_DS_X, train_DS_y, test_DS_X, test_DS_y):
    # Will output a confusion matrix graph for the predicion
    y_pred = classifier.fit(train_DS_X, train_DS_y).predict(test_DS_X)
    labels = set(set(train_DS_y) | set(test_DS_y))

    def plot_confusion_matrix(cm, title='Confusion matrix', cmap=plt.cm.Blues):
        plt.imshow(cm, interpolation='nearest', cmap=cmap)
        plt.title(title)
        plt.colorbar()
        tick_marks = np.arange(len(labels))
        plt.xticks(tick_marks, labels, rotation=45)
        plt.yticks(tick_marks, labels)
        plt.tight_layout()
        plt.ylabel('True label')
        plt.xlabel('Predicted label')

    # Compute confusion matrix
    cm = confusion_matrix(test_DS_y , y_pred)
    np.set_printoptions(precision=2)
    print('Confusion matrix, without normalization')
    #print(cm)
    plt.figure()
    plot_confusion_matrix(cm)

    # Normalize the confusion matrix by row (i.e by the number of samples
    # in each class)
    cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
    print('Normalized confusion matrix')
    #print(cm_normalized)
    plt.figure()
    plot_confusion_matrix(cm_normalized, title='Normalized confusion matrix')

    plt.show()

Relevant Code from Main Function:

# Select training and test data
    PCA = decomposition.PCA(n_components=.95)
    zscorer = ZScoreMapper(param_est=('targets', ['rest']), auto_train=False)
    DS = getVoxels (1, .5)
    train_DS = DS[0]
    test_DS = DS[1]

    # Apply PCA and ZScoring
    train_DS = processVoxels(train_DS, True, zscorer, PCA)
    test_DS = processVoxels(test_DS, False, zscorer, PCA)
    print 3*"\n"

    # Select the desired features
    # If selecting samples or PCA, that must be the only feature
    featuresOfInterest = ['pca']
    trainDSFeat = selectFeatures(train_DS, featuresOfInterest)
    testDSFeat = selectFeatures(test_DS, featuresOfInterest)
    train_DS_X = trainDSFeat[0]
    train_DS_y = trainDSFeat[1]
    test_DS_X = testDSFeat[0]
    test_DS_y = testDSFeat[1]


    # Optimization of neighbors
    # Naively searches for local max starting at numNeighbors
    lastScore = 0
    lastNeightbors = 1
    score = .0000001
    numNeighbors = 5
    while score > lastScore:
        lastScore = score
        lastNeighbors = numNeighbors
        numNeighbors += 1
        #Classification
        neigh = neighbors.KNeighborsClassifier(n_neighbors=numNeighbors, weights='distance')
        neigh.fit(train_DS_X, train_DS_y)

        #Testing
        score = neigh.score(test_DS_X,test_DS_y )

    # Confusion Matrix Output
    neigh = neighbors.KNeighborsClassifier(n_neighbors=lastNeighbors, weights='distance')
    confusionMatrix(neigh, train_DS_X, train_DS_y, test_DS_X, test_DS_y)

Pastebin: http://pastebin.com/U7yTs3vs

Solution

The issue was in part the result of my axis being mislabeled, when I thought I was removing the faulty label I was in actuality just removing a random label, meaning the faulty data was still being analyzed. Fixing the axis and removing the faulty label which was actually rest yielded:

The code I changed is: cm = confusion_matrix(test_DS_y , y_pred, labels)

Basically I manually set the ordering based on my list of ordered labels.