Search code examples
scikit-learnknn

Why is KNN algorithm in scikit not working as expected?


I am building a simple KNN model in python using scikit learn. I tested it on wine dataset from UCI, and I noticed that results returned by .predict() function are not majority class of neighbors.

knn = KNeighborsClassifier(n_neighbors=3,weights='uniform')

knn.fit(wine,class_wine)
predictions = list(knn.predict(wine))
# S is array I've made that chooses majority class from neighbors of each instance
a = list(zip(predictions,list(S)))        

for i in range(0,len(wine)):
    if(predictions[i]!=S[i]):
        print(predictions[i],S[i],class_wine[knn.kneighbors()[1][i].tolist()].tolist())

Output looks like this:

1.0 3.0 [3.0, 2.0, 3.0]
1.0 2.0 [1.0, 2.0, 2.0]
1.0 2.0 [1.0, 2.0, 2.0]
1.0 3.0 [3.0, 1.0, 3.0]
3.0 2.0 [2.0, 3.0, 2.0]

First column is prediction by scikit algorithm, second column is my algorithm that uses kneighbors() function and from the returned list it chooses majority class, as it is supposed to do. Third column is a list of neighbors.

As you can see, predict() from KNeighborsClassifier is doing something differently.

Is there something about implementation of KNeighborsClassifier I am missing?


Solution

  • When using knn.kneighbors(), if you don't use the X parameter, it uses the training data (the stuff in self) used to fit the model, and it excludes the current point from the possible set of neighbors. However, when you use knn.predict it cannot exclude the point since it doesn't know it is the same point (could be some other wine that has the same exact features). Try using knn.kneighbors(wine) instead when building your own predictor.