I am working on a binary classification task. My evaluation data is imbalanced and consists of appr. 20% from class1 and 80% from class2. Even I have good classification accuracy on each class type, as 0.602 on class1, 0.792 on class2 if I calculate f1 score over class1, I get 0.46 since the false-positive count is large. If I calculate it over class2, I get f1-score as 0.84.
My question is that, what is the best practice to evaluate classification task on imbalanced data? Can I get an average of these f1-scores or should I choose one of them? What is the best evaluation metric for the evaluation of classification tasks on imbalanced data?
Btw, these are my TP, TN, FN, FP counts;
TP: 115
TN: 716
FN: 76
FP: 188
I am not sure if that is what you are looking for, but since the data from which you want to get a performance metric from is imbalanced, you could try to apply weighted measurements, such as a weighted f1-score. From scikit-learn the f1-score features a 'weighted' option, which considers the number of instances per label. This way you can get an averaged F1-score.
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html
I hope that helps!