Search code examples
algorithmmachine-learningdata-miningclassificationknn

k nearest neighbor classifier training sample size for each class


Could someone please tell me whether the training sample sizes for each class need to be equal?

Can I take this scenario?

          class1   class2  class3
samples    400      500     300

or should all the classes have equal sample sizes?


Solution

  • The KNN results basically depend on 3 things (except for the value of N):

    • Density of your training data: you should have roughly the same number of samples for each class. Doesn't need to be exact, but I'd say not more than 10% disparity. Otherwise the boundaries will be very fuzzy.
    • Size of your whole training set: you need to have sufficiently enough examples in your training set so your model can generalize to unknown samples.
    • Noise: KNN is very sensitive to noise by nature, so you want to avoid noise in your training set as much as possible.

    Consider the following example where you're trying to learn a donut-like shape in a 2D space.

    By having a different density in your training data (let's say you have more training samples inside of the donut than outside), your decision boundary will be biased like below:

    donut-bad

    On the other hand, if your classes are relatively balanced, you'll get a much finer decision boundary that will be close to the actual shape of the donut:

    enter image description here

    So basically, I would advise trying to balance your dataset (just normalize it somehow), and also take in consideration the 2 other items I mentionned above, and you should be fine.

    In case you have to deal with inbalanced training data, you could also consider using the WKNN algorithm (just an optimization of KNN) to assign stronger weights to your class that has less elements.