python python-2.7 tensorflow keras keras-2

Keras metrics with TF backend vs tensorflow metrics

When Keras 2.x removed certain metrics, the changelog said it did so because they were "Batch-based" and therefore not always accurate. What is meant by this? Do the corresponding metrics included in tensorflow suffer from the same drawbacks? For example: precision and recall metrics.

Solution

Let's take precision for example. The stateless version which was removed was implemented like so:

def precision(y_true, y_pred):  
    """Precision metric.    
     Only computes a batch-wise average of precision.   
     Computes the precision, a metric for multi-label classification of 
    how many selected items are relevant.   
    """ 
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))  
    predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))  
    precision = true_positives / (predicted_positives + K.epsilon())    
    return precision

Which is fine if y_true contains all of the labels in the dataset and y_pred has the model's predictions corresponding to all of those labels.

The issue is that people often divide their datasets into batches, for example evaluating on 10000 images by running 10 evaluations of 1000 images. This can be necessary to fit memory constraints. In this case you'd get 10 different precision scores with no way to combine them.

Stateful metrics solve this issue by keeping intermediate values in variables which last for the whole evaluation. So in the case of precision a stateful metric might have a persistent counter for true_positives and predicted_positives. TensorFlow metrics are stateful, e.g. tf.metrics.precision.