Search code examples
pythonscikit-learndata-scienceknniris-dataset

Knn prediction going 100% on y_test


I'm trying to implement K-nearest neighbors on Iris dataset but after doing the predictions, yhat goes 100% without errors, there must have something wrong and i have no idea what it is...

I created a column named class_id, where i changed:

  • setosa = 1.0
  • versicolor = 2.0
  • virginica = 3.0

that column is type float.

Getting X an Y


    x = df[['sepal length', 'sepal width', 'petal length', 'petal width']].values

type(x) shows nparray


    y = df['class_id'].values

type(y) shows nparray

Normalizing data


    x = preprocessing.StandardScaler().fit(x).transform(x.astype(float))

Creating train and test


    x_train, x_test, y_train, y_test = train_test_split(x,y,test_size = 0.2, random_state = 42)

Checking best K value:


    Ks = 12
    for i in range(1,Ks):
       k = i
       neigh = KNeighborsClassifier(n_neighbors=k).fit(x_train,y_train)
       yhat = neigh.predict(x_test)
       score = metrics.accuracy_score(y_test,yhat)
       print('K: ', k, ' score: ', score, '\n')

Result:

K: 1 score: 0.9666666666666667

K: 2 score: 1.0

K: 3 score: 1.0

K: 4 score: 1.0

K: 5 score: 1.0

K: 6 score: 1.0

K: 7 score: 1.0

K: 8 score: 1.0

K: 9 score: 1.0

K: 10 score: 1.0

K: 11 score: 1.0

Printing y_test and yhat WITH K = 5


    print(yhat)
    print(y_test)

Result:

yhat: [2. 1. 3. 2. 2. 1. 2. 3. 2. 2. 3. 1. 1. 1. 1. 2. 3. 2. 2. 3. 1. 3. 1. 3. 3. 3. 3. 3. 1. 1.]

y_test: [2. 1. 3. 2. 2. 1. 2. 3. 2. 2. 3. 1. 1. 1. 1. 2. 3. 2. 2. 3. 1. 3. 1. 3. 3. 3. 3. 3. 1. 1.]

all of them shouldn't be 100% correct, there must be something wrong


Solution

  • I found the answer with the explanation of skillsmuggler(user):

    You are making use of the iris dataset. It's a well cleaned and model dataset. The features have a strong correlation to the result which results in the kNN model fitting the data really well. To test this you can reduce the size of the training set and this will results in a drop in the accuracy.

    Prediction model was correct.