I have a dataset where I have 360 samples
for class 0
and only 44 samples
for class 1
. When I fit a KNN model
to the data using k=3
the model misclassifies lots of samples as class 0
. What is the best way to deal with such unevenly sampled data? I could set k=1
but from what I have read leads to a noise having a strong effect.
Check out this discussion on CrossValidated, especially the third answer. One approach mentioned, for example, is to weigh neighbors "by the inverse of their class size". In your example with k=3
, this would mean that in a situation where two nearest neighbors are class 0
, and one nearest neighbor is class 1
, the label would be class 1
since 1/44 > 2/360. This is only one approach and you can check out more approaches in the discussion linked above. I hope this helps!