I'm trying to implement K-nearest neighbors on Iris dataset but after doing the predictions, yhat goes 100% without errors, there must have something wrong and i have no idea what it is...
I created a column named class_id, where i changed:
that column is type float.
x = df[['sepal length', 'sepal width', 'petal length', 'petal width']].values
type(x) shows nparray
y = df['class_id'].values
type(y) shows nparray
x = preprocessing.StandardScaler().fit(x).transform(x.astype(float))
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size = 0.2, random_state = 42)
Ks = 12
for i in range(1,Ks):
k = i
neigh = KNeighborsClassifier(n_neighbors=k).fit(x_train,y_train)
yhat = neigh.predict(x_test)
score = metrics.accuracy_score(y_test,yhat)
print('K: ', k, ' score: ', score, '\n')
K: 1 score: 0.9666666666666667
K: 2 score: 1.0
K: 3 score: 1.0
K: 4 score: 1.0
K: 5 score: 1.0
K: 6 score: 1.0
K: 7 score: 1.0
K: 8 score: 1.0
K: 9 score: 1.0
K: 10 score: 1.0
K: 11 score: 1.0
print(yhat)
print(y_test)
yhat: [2. 1. 3. 2. 2. 1. 2. 3. 2. 2. 3. 1. 1. 1. 1. 2. 3. 2. 2. 3. 1. 3. 1. 3. 3. 3. 3. 3. 1. 1.]
y_test: [2. 1. 3. 2. 2. 1. 2. 3. 2. 2. 3. 1. 1. 1. 1. 2. 3. 2. 2. 3. 1. 3. 1. 3. 3. 3. 3. 3. 1. 1.]
all of them shouldn't be 100% correct, there must be something wrong
I found the answer with the explanation of skillsmuggler(user):
You are making use of the iris dataset. It's a well cleaned and model dataset. The features have a strong correlation to the result which results in the kNN model fitting the data really well. To test this you can reduce the size of the training set and this will results in a drop in the accuracy.
Prediction model was correct.