I am facing a False Positive Reduction problem, and ratio of the size of positive and negative is approx. 1.7:1. I learned from the answer that using precision, recall, FScore, or even weighting true-positive, false-positive, true-negative and false-negative differently dependent on cost to evaluate different models to deal with specified classification task.
Since Precision, Recall, and FScore are removed from keras, I found some methods to do the tracking of those metrics during training, such as github repo keras-metrics.
Besides, I also find ohter solutions by defining precision like this,
def precision(y_true, y_pred):
"""Precision metric.
Only computes a batch-wise average of precision.
Computes the precision, a metric for multi-label classification of
how many selected items are relevant.
"""
true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
precision = true_positives / (predicted_positives + K.epsilon())
return precision
However, since those methods is tracking the metrics during training, and all of those claim to be batch-wise average
rather than a global value.
I wonder how neccessary is that to keep track on those metrics during training. Or I just focus on the loss
and accuracy
during training, and evaluate all models using validation functions from like scikit-learn
to compare those metrics with a global method.
In Keras, all training metrics are measured batch-wise. To obtain a global metric, Keras will average these batch-metrics.
Something like sum(batch_metrics) / batches
.
Since most metrics are mean values considering the "number of samples", doing that kind of averaging will not change the global value too much.
If samples % batch_size == 0
, then we can say that:
sum(all_samples_metrics) / samples == sum(all_batch_metrics) / batches
But these specific metrics you are talking about are not divided by the "number of samples", but by the number of samples "that satisfy a condition". Thus, the divisor in each batch is different. Mathematically, the result of averaging the batch-metrics to obtain a global result will not reflect the true global result.
So, can we say that they're not good for training?
Well, no. They may be good for training. Sometimes "accuracy" is a terrible metric for a specific problem.
The key to use these metrics batch-wise is to have a batch size that is big enough to avoid too much variation in the divisors.