Search code examples
machine-learningscikit-learnprobabilitynearest-neighbor

Probability prediction method of KNeighborsClassifier returns only 0 and 1


Can anyone tell me what's the problem with my code? Why I can predict probability of iris dataset by using LinearRegression but, KNeighborsClassifier gives me 0 or 1 while it should give me a result like the one LinearRegression yields?

from sklearn.datasets import load_iris
from sklearn import metrics

iris = load_iris()
X = iris.data
y = iris.target

for train_index, test_index in skf:
    X_train, X_test = X_total[train_index], X_total[test_index]
    y_train, y_test = y_total[train_index], y_total[test_index]

from sklearn.linear_model import LogisticRegression
ln = LogisticRegression()
ln.fit(X_train,y_train)

ln.predict_proba(X_test)[:,1]

array([ 0.18075722, 0.08906078, 0.14693156, 0.10467766, 0.14823032, 0.70361962, 0.65733216, 0.77864636, 0.67203114, 0.68655163, 0.25219798, 0.3863194 , 0.30735105, 0.13963637, 0.28017798])

from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5, algorithm='ball_tree', metric='euclidean')
knn.fit(X_train, y_train)

knn.predict_proba(X_test)[0:10,1]

array([ 0., 0., 0., 0., 0., 1., 1., 1., 1., 1.])


Solution

  • Because KNN has very limited concept of probability. Its estimate is simply fraction of votes among nearest neighbours. Increase number of neighbours to 15 or 100 or query point near the decision boundary and you will see more diverse results. Currently your points are simply always having 5 neighbours of the same label (thus probability 0 or 1).