python tensorflow keras deep-learning loss-function

Difference between batch-average and global Fscore

I am facing a False Positive Reduction problem, and ratio of the size of positive and negative is approx. 1.7:1. I learned from the answer that using precision, recall, FScore, or even weighting true-positive, false-positive, true-negative and false-negative differently dependent on cost to evaluate different models to deal with specified classification task.

Since Precision, Recall, and FScore are removed from keras, I found some methods to do the tracking of those metrics during training, such as github repo keras-metrics.

Besides, I also find ohter solutions by defining precision like this,

def precision(y_true, y_pred):
    """Precision metric.
    Only computes a batch-wise average of precision.
    Computes the precision, a metric for multi-label classification of
    how many selected items are relevant.
    """
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
    precision = true_positives / (predicted_positives + K.epsilon())
    return precision

However, since those methods is tracking the metrics during training, and all of those claim to be batch-wise average rather than a global value. I wonder how neccessary is that to keep track on those metrics during training. Or I just focus on the loss and accuracy during training, and evaluate all models using validation functions from like scikit-learn to compare those metrics with a global method.

Solution

In Keras, all training metrics are measured batch-wise. To obtain a global metric, Keras will average these batch-metrics.

Something like sum(batch_metrics) / batches.

Since most metrics are mean values considering the "number of samples", doing that kind of averaging will not change the global value too much.

If samples % batch_size == 0, then we can say that:

sum(all_samples_metrics) / samples == sum(all_batch_metrics) / batches

But these specific metrics you are talking about are not divided by the "number of samples", but by the number of samples "that satisfy a condition". Thus, the divisor in each batch is different. Mathematically, the result of averaging the batch-metrics to obtain a global result will not reflect the true global result.

So, can we say that they're not good for training?

Well, no. They may be good for training. Sometimes "accuracy" is a terrible metric for a specific problem.

The key to use these metrics batch-wise is to have a batch size that is big enough to avoid too much variation in the divisors.