Search code examples
classificationsample

How to deal with uneven number of samples in classification?


Suppose we have 2 labels: 0 and 1.

The data number with label 0 is 1000 but the data with label 1 is just 100.

In this situation, the training of classification will be bias to the result of label 0.

What can be done in this scenario?

Can we generate samples manually corresponding to label 1?

If we can do so, how to validate that the generated samples possess the same properties/characteristics as the original data?


Solution

  • See this aricle. It's about a method called SMOTE which stands for Synthetic Minority Over-sampling Technique. Basically if you have data distributed like this (small number of red dots, larger number of green dots): enter image description here

    You synthesize new samples around the existing ones: enter image description here

    This method is one of the commonly used ones and it is described in greater detail in the article linked above. There are other simpler methods like removing some datapoints from the majority class or duplicating some of the ones in the minority class.

    The images have been taken from the article.