Search code examples
pythonscikit-learntensorflow2.0multiclass-classificationauc

Interpreting AUC, accuracy and f1-score on the unbalanced dataset


I am trying to understand how AUC is a better metric than classification accuracy in the case when the dataset is unbalanced.
Suppose a dataset is containing 1000 examples of 3 classes as follows:

a = [[1.0, 0, 0]]*950 + [[0, 1.0, 0]]*30 + [[0, 0, 1.0]]*20

Clearly, this data is unbalanced.
A naive strategy is to predict every point belonging to the first class.
Suppose we have a classifier with the following predictions:

b = [[0.7, 0.1, 0.2]]*1000

With the true labels in the list a and predictions in the list b, classification accuracy is 0.95.
So one would believe that the model is really doing good on the classification task, but it is not because the model is predicting every point in one class.
Therefore, the AUC metric is suggested for evaluating an unbalanced dataset.
If we predict AUC using TF Keras AUC metric, we obtain ~0.96.
If we predict f1-score using sklearn f1-score metric by setting b=[[1,0,0]]*1000, we obtain 0.95.

Now I am a little bit confused because all the metrics (Accuracy, AUC and f1-score) are showing high value which means that the model is really good at the prediction task (which is not the case here).

Which point I am missing here and how we should interpret these values?
Thanks.


Solution

  • You are very likely using the average='micro' parameter to calculate the F1-score. According to the docs, specifying 'micro' as the averaging strategy will:

    Calculate metrics globally by counting the total true positives, false negatives and false positives.

    In classification tasks where every test case is guaranteed to be assigned to exactly one class, computing a micro F1-score is equivalent to computing the accuracy score. Just check it out:

    from sklearn.metrics import accuracy_score, f1_score
    
    y_true = [[1, 0, 0]]*950 + [[0, 1, 0]]*30 + [[0, 0, 1]]*20
    y_pred = [[1, 0, 0]]*1000
    
    print(accuracy_score(y_true, y_pred)) # 0.95
    
    print(f1_score(y_true, y_pred, average='micro')) # 0.9500000000000001
    

    You basically computed the same metric twice. By specifying average='macro' instead, the F1-score will be computed for each label independently first, and then averaged:

    print(f1_score(y_true, y_pred, average='macro')) # 0.3247863247863248
    

    As you can see, the overall F1-score depends on the averaging strategy, and a macro F1-score of less than 0.33 is a clear indicator of a model's deficiency in the prediction task.


    EDIT:

    Since the OP asked when to choose which strategy, and I think it might be useful for others as well, I will try to elaborate a bit on this issue.

    scikit-learn actually implements four different stratagies for metrics that support averages for multiclass and multilabel classification tasks. Conveniently, the classification_report will return all of those that apply for a given classification task for Precision, Recall and F1-score:

    from sklearn.metrics import classification_report
    
    # The same example but without nested lists. This avoids sklearn to interpret this as a multilabel problem.
    y_true = [0 for i in range(950)] + [1 for i in range(30)] + [2 for i in range(20)]
    y_pred = [0 for i in range(1000)]
    
    print(classification_report(y_true, y_pred, zero_division=0))
    
    ######################### output ####################
    
                  precision    recall  f1-score   support
    
               0       0.95      1.00      0.97       950
               1       0.00      0.00      0.00        30
               2       0.00      0.00      0.00        20
    
        accuracy                           0.95      1000
       macro avg       0.32      0.33      0.32      1000
    weighted avg       0.90      0.95      0.93      1000
    

    All of them provide a different perspective depending on how much emphasize one puts on the class distributions.

    1. micro average is a global strategy that basically ignores that there is a distinction between classes. This might be useful or justified if someone is really just interested in overall disagreement in terms of true positives, false negatives and false positives, and is not concerned about differences within the classes. As hinted before, if the underlying problem is not a multilabel classification task, this actually equals the accuracy score. (This is also why the classification_report function returned accuracy instead of micro avg).

    2. macro average as a strategy will calculate each metric for each label separately and return their unweighted mean. This is suitable if each class is of equal importance and the result shall not be skewed in favour of any of the classes in the dataset.

    3. weighted average will also first calculate each metric for each label separately. But the average is weighted according to the classes' support. This is desirable if the importance of the classes is proportional to their importance, i.e. a class that is underrepresented is considered less important.

    4. samples average is only meaningful for multilabel classification and therefore not returned by classification_report in this example and also not discussed here ;)

    So the choice of averaging strategy and the resulting number to trust really depends on the importance of the classes. Do I even care about class differences (if no --> micro average) and if so, are all classes equally important (if yes --> macro average) or is the class with higher support more important (--> weighted average).