I tried to find the max 3 values in the list for implementing my knn model. While trying to do so, I did it using the method that was intuitive to me the code was something as follows `
first_k = X_train['distance'].sort_values().head(k)
prediction = first_k.value_counts().idxmax()
` The first_k list contains the first k elements from the sorted values of the distance column. Prediction is what the model will return at last.
Another approach I found on the internet was this `
prediction = y_train[X_train["distance"].nsmallest(n=k).index].mode()[0]
` The second approach yields the correct results and my approach did not work as intended. Can someone explain to me why my approach did not work.
The difference is in the usage of .index
after the method nsmallest(n=k)
in the alternative approach. What you are doing in your code is the following:
distance
as sorting key, then take the first k elements in the sorted datasetThe alternative approach instead does the following steps:
distance
columnk=5
it could be an element that when printed shows something similar to Int64Index([3, 9, 10, 1, 8], dtype='int64')
y
the labels with the same index values of the ones recovered in the previous stepy
(or the mode
)So, as you can see, the main difference is the fact that the most frequent distance is not necessarily the most frequent class among the K neighbours that you have recovered.
Anyway you code can be easily fixed:
first_k = X_train['distance'].sort_values().head(k).index
prediction = y_train[first_k].mode()[0]