Search code examples
machine-learningimage-classification

Is that idol to make each data exactly equal in number to make better machine learning?


Hi I'm making a Image Classification in cnn and the data I use are 7 classes and they are 235, 211, 251, .... and total is 1573. I heard that It's important to increase data equality, but I don't know how equal. I mean should I make each data almost equal like difference between them within 1? Or like In my case, the biggest difference is 46 between class1 and class2 but it's still okay? Then which way is better, cutting off or adding data using ImageGenerator?

Could someone give me some advice?


Solution

  • What you're referring here is what we call "Data Imbalance" problem in ML lingo. This refers to the fact when you have the number of observations in one class is much higher than the other class. You can think of a case where the ratio between 2 classes is something like 1:100. though there are no strict rules as to how imbalance the ratio should be.

    In your case, the biggest imbalance between the 2 classes is 46, which doesn't seem that big of a difference in absolute term, being said that It also seems the sample size isn't quite big either in relative term.

    The problem with data imbalance is you can think of it as a zero-sum game, where 2 classes keep pushing the decision boundary, so when you've 2 classes with quite the same amount of samples (It doesn't have to be strictly equal) then the decision boundary comes kind of at a "Nash equilibrium" state, means 2 classes are pushing the boundary at the same force, hence the boundary is at the middle and successfully discriminates the 2 classes, but when these forces become very un-equal then the majority class pushes the boundary as such that the minority class can't back it off, so the decision boundary fails to discriminate 2 classes...

    (Note: The scenario mentioned above is highly superficial, imaginative (without much theory to back it off) and only described to build a perspective/intuition about the problem)

    So, I would advise you to train the model with your current data the way it is, and see how it performs. From the theoretical point of view, It shouldn't be that bad, as your case isn't very bad. Nonetheless, ML is an empirical science, so you can also see what happens when some data imbalance technique is applied. there are techniques such as undersampling, oversampling. You can create data for the minority class (oversampling) or you can reduce the data samples of the majority class (undersampling). SMOTE is a popular oversampling technique and RUSBoost is a popular undersampling technique.