Search code examples
pythonscikit-learnknn

Classification of unevenly sampled data using KNN


I have a dataset where I have 360 samples for class 0 and only 44 samples for class 1. When I fit a KNN model to the data using k=3 the model misclassifies lots of samples as class 0. What is the best way to deal with such unevenly sampled data? I could set k=1 but from what I have read leads to a noise having a strong effect.


Solution

  • Check out this discussion on CrossValidated, especially the third answer. One approach mentioned, for example, is to weigh neighbors "by the inverse of their class size". In your example with k=3, this would mean that in a situation where two nearest neighbors are class 0, and one nearest neighbor is class 1, the label would be class 1 since 1/44 > 2/360. This is only one approach and you can check out more approaches in the discussion linked above. I hope this helps!