Search code examples
machine-learningtraining-data

Building model that makes a decision where difference between two classes is too large


Currently I'm building a ML model for making decision depends on some conditions using classifier. Btw the data that I've collected is too abnormal: Assumes that my data are classified as A and B, and ratio of records of class A to class B is about 1:300.

Are there any ways to handle this model? I tried many different approaches but the result are all overfitting.


Solution

  • The problem that you phrased as too abnormal is called imbalanced dataset among machine-learning, data-mining, statistics, ... communities. This is the situation where the classes are not represented equally.

    This is not a rare case at all. In fact, in many of the classification problems, the event of interest does not happen very often and this is why it is of interest. Therefore, the label for this event to occur is very infrequent comparing to the other labels.

    There are plenty of different approaches dealing with imbalanced dataset, and in most of the cases, the attempt is to make it balanced. Under-sampling and Over-sampling are the typical approaches. Usually, a combination of these two could give a better results.

    First Google suggestion gives me this: