Search code examples
pythonpytorch-lightning

PyTorch Lightning - Display metrics after validation epoch


I've implemented validation_epoch_end to produce and log metrics, and when I run trainer.validate, the metrics appear in my notebook.

However, when I run trainer.fit, only the training metrics appear; not the validation ones.

The validation step is still being run (because the validation code calls a print statement, which does appear), but the validation metrics don't appear, even though they're logged. Or, if they do appear, the next epoch immediately erases them, so that I can't see them.

(Likewise, tensorboard sees the validation metrics)

How can I see the validation epoch end metrics in a notebook, as each epoch occurs?


Solution

  • You could do the following. Let's say you have the following LightningModule:

    class MNISTModel(LightningModule):
        def __init__(self):
            super().__init__()
            self.l1 = torch.nn.Linear(28 * 28, 10)
    
        def forward(self, x):
            return torch.relu(self.l1(x.view(x.size(0), -1)))
    
        def training_step(self, batch, batch_nb):
            x, y = batch
            loss = F.cross_entropy(self(x), y)
            # prog_bar=True will display the value on the progress bar statically for the last complete train epoch
            self.log("train_loss", loss, on_step=False, on_epoch=True, prog_bar=True)
    
            return loss
    
        def validation_step(self, batch, batch_nb):
            x, y = batch
            loss = F.cross_entropy(self(x), y)
            # prog_bar=True will display the value on the progress bar statically for the last complete validation epoch
            self.log("val_loss", loss, on_step=False, on_epoch=True, prog_bar=True)
    
            return loss
    
        def configure_optimizers(self):
            return torch.optim.Adam(self.parameters(), lr=0.02)
    

    The trick is to use prog_bar=True in combination with on_step and on_epoch depending on when you want the update on the progress bar. So, in this case, when training:

    # Train the model ⚡
    trainer.fit(mnist_model, MNIST_dm)
    

    you will see:

    Epoch 4: 100% -------------------------- 939/939 [00:09<00:00, 94.51it/s, loss=0.636, v_num=4, val_loss=0.743, train_loss=0.726]
    

    Where loss will be updating each batch as it is the step loss. However, val_loss and train_loss will be static values that will only change after each validation or train epoch respectively.