Search code examples

Calculating Fscore for each epoch using keras (not batch-wise)

Essence of this question:

I'd like to find a proper way to calculate the Fscore for the validation and training data after each epoch (not batch-wise)

For a binary classification task, I'd like to calculate the Fscore after each epoch using a simple keras model. But how to calculate the Fscore seems quite the discussion.

I know keras works in batches and one way to calculate the fscore for each batch would be (Fscore-calculation: f1).

The batch-wise calculation can be quite confusing and I prefer to calculate Fscore after each epoch. So just calling history.history['f1'] or history.history['val_f1'] does not do the trick, cause it shows the batch-wise fscores.

I figured one way is to save each model using the from keras.callbacks import ModelCheckpoint function:

  1. Saving each model-weights after every epoch
  2. Reloading the model and using model.evaluate or model.predict


Using tensorflow backend, I decided to track TruePositives, FalsePositives and FalseNegatives (as umbreon29 suggested). But now comes the fun part: The results when reloading the model are different for the training data (TP, FP, FN are different) but not for the validation set!

So a simple model storing the weights to rebuild each model and recalculate the TP,FN,TP (and finally the Fscore) looks like:

from keras.metrics import TruePositives, TrueNegatives, FalseNegatives, FalsePositives

## simple keras model
sequence_input = Input(shape=(input_dim,), dtype='float32')
preds = Dense(1, activation='sigmoid',name='output')(sequence_input)
model = Model(sequence_input, preds)


# model checkpoints
checkpoint = ModelCheckpoint(os.path.join(savemodel,filepath), monitor='val_f1', verbose=1, save_best_only=False, save_weights_only=True, mode='auto')
callbacks_list = [checkpoint]

history =, y_train, validation_data=(x_val, y_val), epochs=epoch, batch_size=batch,

## Saving TP, FN, FP to calculate Fscore

arr_train = np.stack((tp, fp, fn), axis=1)

## doing the same for tp_val, fp_val, fn_val 
arr_val = np.stack((tp_val, fp_val, fn_val), axis=1)

## following method just showes batch-wise fscores and shouldnt be used:
## f1_sc.append(history.history['f1'])  

Reloading the model after each epoch to calculate the Fscores (The predict method with sklearn fscore metric from sklearn.metrics import f1_score is equivalent to the calculating fscore metric from TP,FP, FN):

Fscore_val = []
fscorepredict_val_sklearn = []
Fscore_train = []
fscorepredict_train = []

## model_loads contains list of model-paths
for i in model_loads:
    ## rebuilding the model each time since only weights are stored
    sequence_input = Input(shape=(input_dim,), dtype='float32')
    preds = Dense(1, activation='sigmoid',name='output')(sequence_input)
    model = Model(sequence_input, preds)
    # Compile model (required to make predictions)
    ### For Validation data
    ## using evaluate
    y_pred =  model.evaluate(x_val, y_val, verbose=0)
    Fscore_val.append(y_pred)  ## contains (loss,tp,fp,fn, f1-batchwise)
    ## using predict
    y_pred = model.predict(x_val)
    val_preds = [1 if x > 0.5 else 0 for x in y_pred]
    cm = f1_score(y_val, val_preds)
    fscorepredict_val_sklearn.append(cm)  ## equivalent to Fscore calculated from Fscore_vals tp,fp, fn

    ### For the training data
    y_pred =  model.evaluate(x_train, y_train, verbose=0) 
    Fscore_train.append(y_pred) ## also contains (loss,tp,fp,fn, f1-batchwise)
    y_pred =  model.predict(x_train, verbose=0)  # gives probabilities
    train_preds = [1 if x > 0.5 else 0 for x in y_pred]
    cm = f1_score(y_train, train_preds)

Calculating the Fscore from the tp,fn, and fp using Fscore_val's tp,fn,fp and comparing it tofscorepredict_val_sklearn is equivalent and identical to calculating it from arr_val.

However, the number of tp,fn, and fp is different when comparing Fscore_train and arr_train. Therefore, I also arrive at different Fscores. The number of tp,fn,fp should be the same but they arent.. Is this a bug?

Which one should I trust? The fscorepredict_train seem actually more trustworthy, since they start above the "always guessing class 1"-Fscore (when recall=1). (fscorepredict_train[0]=0.6784 vs f_hist[0]=0.5736 vs always-guessing-class-1-fscore = 0.6751)

[Note: Fscore_train[0] = [0.6853608025386962, 2220.0, 250.0, 111.0, 1993.0, 0.6730511784553528] (loss,tp,tn,fp,fn) leading to fscore= 0.6784 , so Fscore from Fscore_train = fscorepredict_train ]


  • I provide a custom callback that computes the score (in your case F1 from sklearn) on ALL the data at the end of the epoch (for train and optionally validation)

    class F1History(tf.keras.callbacks.Callback):
        def __init__(self, train, validation=None):
            super(F1History, self).__init__()
            self.validation = validation
            self.train = train
        def on_epoch_end(self, epoch, logs={}):
            logs['F1_score_train'] = float('-inf')
            X_train, y_train = self.train[0], self.train[1]
            y_pred = (self.model.predict(X_train).ravel()>0.5)+0
            score = f1_score(y_train, y_pred)       
            if (self.validation):
                logs['F1_score_val'] = float('-inf')
                X_valid, y_valid = self.validation[0], self.validation[1]
                y_val_pred = (self.model.predict(X_valid).ravel()>0.5)+0
                val_score = f1_score(y_valid, y_val_pred)
                logs['F1_score_train'] = np.round(score, 5)
                logs['F1_score_val'] = np.round(val_score, 5)
                logs['F1_score_train'] = np.round(score, 5)

    here a dummy example:

    x_train = np.random.uniform(0,1, (30,10))
    y_train = np.random.randint(0,2, (30))
    x_val = np.random.uniform(0,1, (20,10))
    y_val = np.random.randint(0,2, (20))
    sequence_input = Input(shape=(10,), dtype='float32')
    preds = Dense(1, activation='sigmoid',name='output')(sequence_input)
    model = Model(sequence_input, preds)
    es = EarlyStopping(patience=3, verbose=1, min_delta=0.001, monitor='F1_score_val', mode='max', restore_best_weights=True)
    model.compile(loss='binary_crossentropy', optimizer='adam'),y_train, epochs=10, 

    the output print:

    Epoch 1/10
    1/1 [==============================] - 0s 78ms/step - loss: 0.7453 - F1_score_train: 0.3478 - F1_score_val: 0.4762
    Epoch 2/10
    1/1 [==============================] - 0s 57ms/step - loss: 0.7448 - F1_score_train: 0.3478 - F1_score_val: 0.4762
    Epoch 3/10
    1/1 [==============================] - 0s 58ms/step - loss: 0.7444 - F1_score_train: 0.3478 - F1_score_val: 0.4762
    Epoch 4/10
    1/1 [==============================] - ETA: 0s - loss: 0.7439Restoring model weights from the end of the best epoch.
    1/1 [==============================] - 0s 70ms/step - loss: 0.7439 - F1_score_train: 0.3478 - F1_score_val: 0.4762

    I have TF 2.2 and works without problems, I hope this help