Search code examples
pandasdatasetknntraining-data

What the point of creating traing and testing data in kNN?


I'm facing my first machine learning algorithm which is knn, and the thing that has confused me the most is spliting the dataset into training and testing data. With more complicated ms algorithms I can imagine that the computer needs to have a 'training' process, but knn is more straightforward and having a training set is unnecessary. Either that or i haven't comprehended knn completely.

For the background: I'm having a dataset and have to ask for some input from the user. From there i can find k nearest neighbors of the user.

I'd be very greatful for your explanation. Thank you in advance:).


Solution

  • KNN usually has validation dataset to get the optimal number of neighbors to take into consideration.

    Given that, on test set you check how your algorithm performs "in the wild".

    If you can somehow find from the user's input the optimal number of neighbors to take into consideration you don't need either test or validation. If not (e.g. user inputs some value, but given that you can't be certain about preferable number of neighbors to consider), you should do both validation and test (or some other variant, e.g. K-Fold, to find hyperparameters).

    EDIT: There are other hyperparameters like distance metric, but the idea holds.