Search code examples
pythonscikit-learnartificial-intelligenceclassificationprediction

Curse of the SKlearn Classifiers


Suppose we have 1,000 beads 900 red and 100 blue ones. When I run the problem through SKlearn classifier ensembles,

score = clf.score(X_test, y_test)

They come up with scores of around .9 however, when I look at the predictions I see that it has predicted all of them to be Red and this is how it comes up with %90 accuracy! Please tell me what I'm doing wrong? Better yet, what does it mean when this happens? Is there a better way to measure accuracy?


Solution

  • This might happen when you have an imbalanced dataset, and you chose accuracy as your metric. The reason is that by always deciding red, the model is actually doing OK in terms of accuracy, but as you noticed, the model is useless! In order to overcome this issue, you have some alternatives such as:
    1. Use another metric, like AUC (area under roc curve), etc.
    2. Use different weights for classes, and put more weight on the minority class.
    3. Use simple over-sampling or under-sampling methods, or other more sophisticated ones like SMOTE, ADASYN, etc.

    You can also take a look at this article.
    This problem you face is quite common in real world applications.