Search code examples
pythonscikit-learnknnkdtree

Did I just find a bug in sklearn KNN classifier or does everything work as intended?


I have been toying with python sklearn k nearest neighbours classifier and I believe it does not work right - results with k above 1 are wrong. I've tried to visualise how different k-nn methods vary with my sample code.

Code is a bit long but not very complicated. Go ahead and run it yourself to get the pictures. I generate sample 2D data in form of columns of around 10 points. Most of the code is about plotting it nicely on the graph in animated fashion. All classification happens after calling constructing library object KNeighborsClassifier in "main" in for loop.

I tried different methods of algorithm suspecting it's kd-trees issue, but I got the same results (swap algorithm="kdtree" for "brute" or ball trees)

Here is the illustration of results I got:

result of classifier with k=3 and uniform weights, kdtrees

Picture comment: As you can see in the 3rd column around x=2 all area around red points should be red for example, and around x=-4 the area should be blue, because next nearest red points are in neighbour column. I believe it's not how the classifier should behave and I'm not sure whether I'm not doing something right or it's the library methods error. I've tried to review it's code but decided to ask the question in the mean time. Also I'm not familiar with C-Python it's written in.

Sources and version: I made the code using scikit-learn documentation, and mathplotlib examples. I run python 3.6.1 and version 0.18.1 of sklearn.

Bonus question: Is k-neighbours answer using kd-trees approximate or definite? From my understanding it can easily work perfectly for k=1 but you I'm not sure if the answer is always correct with k above 1.

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn import neighbors
import random


random.seed(905) # 905
# interesting seed 2293
def generate_points(sizex, sizey):
    # sizex = 4
    # sizey = 10
    apart = 5
    # generating at which X coordinate my data column will be
    columns_x = [random.normalvariate(0, 5) for i in range(sizex)]
    columns_y = list()
    # randomising for each column the Y coordinate at which it starts
    for i in range(sizex):
        y_column = [random.normalvariate(-50, 100) for j in range(sizey)]
        y_column.sort()
        columns_y.append(y_column)

    # preparing lists of datapoints with classification
    datapoints = np.ndarray((sizex * sizey, 2))
    dataclass = list()

    # genenerating random split for each column
    for i in range(sizex):
        division = random.randint(0, sizey)
        for j in range(sizey):
            datapoints[i * sizey + j][0] = columns_x[i]
            datapoints[i * sizey + j][1] = -j * apart
            dataclass.append(j < division)

    return datapoints, dataclass


if __name__ == "__main__":
    datapoints, dataclass = generate_points(4, 10)

    #### VISUALISATION PART ####
    x_min, y_min = np.argmin(datapoints, axis=0)
    x_min, y_min = datapoints[x_min][0], datapoints[y_min][1]
    x_max, y_max = np.argmax(datapoints, axis=0)
    x_max, y_max = datapoints[x_max][0], datapoints[y_max][1]
    x_range = x_max - x_min
    y_range = y_max - y_min
    x_min -= 0.15*x_range
    x_max += 0.15*x_range
    y_min -= 0.15*y_range
    y_max += 0.15*y_range

    mesh_step_size = .1

    # Create color maps
    cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF']) # for meshgrid
    cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF']) # for points

    plt.ion() # plot interactive mode
    for weights in ['uniform', 'distance']: # two types of algorithm
        for k in range(1, 13, 2): # few k choices
            # we create an instance of Neighbours Classifier and fit the data.
            clf = neighbors.KNeighborsClassifier(k, weights=weights, algorithm="kd_tree")
            clf.fit(datapoints, dataclass)

            # Plot the decision boundary. For that, we will assign a color to each
            # point in the mesh [x_min, x_max]x[y_min, y_max].
            xx, yy = np.meshgrid(np.arange(x_min, x_max, mesh_step_size),
                                 np.arange(y_min, y_max, mesh_step_size))
            Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])

            # Put the result into a color plot
            Z = Z.reshape(xx.shape)

            plt.figure(1)
            plt.pcolormesh(xx, yy, Z, cmap=cmap_light)

            # Plot also the training points
            plt.scatter(datapoints[:, 0], datapoints[:, 1], c=dataclass, cmap=cmap_bold, marker='.')
            plt.xlim(xx.min(), xx.max())
            plt.ylim(yy.min(), yy.max())
            plt.title("K-NN classifier (k = %i, weights = '%s')"
                      % (k, weights))

            plt.draw()
            input("Press Enter to continue...")
            plt.clf()

Also I've decided to set seed before posting so we all get the same result, feel free to set random seed.


Solution

  • Your output seems to be fine.

    Something that might not be obvious from your graph is that the horizontal distances between points are actually shorter than the vertical distances. Even the furthest horizontal separation between two adjacent columns is 4.something, while the vertical separation between any two adjacent rows is 5.

    For the points classified as red, a majority of their 3 nearest neighbors in the training set really are red. It doesn't matter whether they're super-close to a blue point if the next two neighbors are red. Same for the points classified as blue near red points.