PyTorch Lightning - Display metrics after validation epoch

I've implemented validation_epoch_end to produce and log metrics, and when I run trainer.validate, the metrics appear in my notebook.

However, when I run trainer.fit, only the training metrics appear; not the validation ones.

The validation step is still being run (because the validation code calls a print statement, which does appear), but the validation metrics don't appear, even though they're logged. Or, if they do appear, the next epoch immediately erases them, so that I can't see them.

(Likewise, tensorboard sees the validation metrics)

How can I see the validation epoch end metrics in a notebook, as each epoch occurs?

Solution

You could do the following. Let's say you have the following LightningModule:

class MNISTModel(LightningModule):
    def __init__(self):
        super().__init__()
        self.l1 = torch.nn.Linear(28 * 28, 10)

    def forward(self, x):
        return torch.relu(self.l1(x.view(x.size(0), -1)))

    def training_step(self, batch, batch_nb):
        x, y = batch
        loss = F.cross_entropy(self(x), y)
        # prog_bar=True will display the value on the progress bar statically for the last complete train epoch
        self.log("train_loss", loss, on_step=False, on_epoch=True, prog_bar=True)

        return loss

    def validation_step(self, batch, batch_nb):
        x, y = batch
        loss = F.cross_entropy(self(x), y)
        # prog_bar=True will display the value on the progress bar statically for the last complete validation epoch
        self.log("val_loss", loss, on_step=False, on_epoch=True, prog_bar=True)

        return loss

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=0.02)

The trick is to use prog_bar=True in combination with on_step and on_epoch depending on when you want the update on the progress bar. So, in this case, when training:

# Train the model ⚡
trainer.fit(mnist_model, MNIST_dm)

you will see:

Epoch 4: 100% -------------------------- 939/939 [00:09<00:00, 94.51it/s, loss=0.636, v_num=4, val_loss=0.743, train_loss=0.726]

Where loss will be updating each batch as it is the step loss. However, val_loss and train_loss will be static values that will only change after each validation or train epoch respectively.