Search code examples
pythonloggingpytorch-lightning

How do you make the reported step a multiple of the logging frequency in PyTorch-Lightning, not the logging frequency minus 1?


[Warning!! pedantry inside]

I'm using PyTorch Lightning to wrap my PyTorch model, but because I'm pedantic, I am finding the logger to be frustrating in the way it reports the steps at the frequency I've asked for, minus 1:

  1. When I set log_every_n_steps=100 in Trainer, my Tensorboard output shows my metrics at step 99, 199, 299, etc. Why not at 100, 200, 300?
  2. When I set check_val_every_n_epoch=30 in Trainer, my console output shows progress bar goes up to epoch 29, then does a validate, leaving a trail of console outputs that report metrics after epochs 29, 59, 89, etc. Like this:
Epoch 29: 100%|█████████████████████████████| 449/449 [00:26<00:00, 17.01it/s, loss=0.642, v_num=logs]
[validation] {'roc_auc': 0.663, 'bacc': 0.662, 'f1': 0.568, 'loss': 0.633}
Epoch 59: 100%|█████████████████████████████| 449/449 [00:26<00:00, 16.94it/s, loss=0.626, v_num=logs]
[validation] {'roc_auc': 0.665, 'bacc': 0.652, 'f1': 0.548, 'loss': 0.630}
Epoch 89: 100%|█████████████████████████████| 449/449 [00:27<00:00, 16.29it/s, loss=0.624, v_num=logs]
[validation] {'roc_auc': 0.665, 'bacc': 0.652, 'f1': 0.548, 'loss': 0.627}

Am I doing something wrong? Should I simply submit a PR to PL to fix this?


Solution

  • You are not doing anything wrong. Python uses zero-based indexing so epoch counting starts at zero as well. If you want to change the behavior of what is being displayed you will need to override the default TQDMProgressBar and modify on_train_epoch_start to display an offsetted value. You can achieve this by:

    from pytorch_lightning.callbacks.progress.tqdm_progress import convert_inf
    
    class LitProgressBar(TQDMProgressBar):
        def init_validation_tqdm(self):
            bar = super().init_validation_tqdm()
            bar.set_description("running validation...")
            return bar
        def on_train_epoch_start(self, trainer, *_) -> None:
            total_train_batches = self.total_train_batches
            total_val_batches = self.total_val_batches
            if total_train_batches != float("inf") and total_val_batches != float("inf"):
                # val can be checked multiple times per epoch
                val_checks_per_epoch = total_train_batches // trainer.val_check_batch
                total_val_batches = total_val_batches * val_checks_per_epoch
            total_batches = total_train_batches + total_val_batches
            self.main_progress_bar.reset(convert_inf(total_batches))
            self.main_progress_bar.set_description(f"Epoch {trainer.current_epoch + 1}")
    

    Notice the +1 in the last line of code. This will offset the epoch displayed in the progress bar. Then pass your custom bar to your trainer:

    # Initialize a trainer
    trainer = Trainer(
        accelerator="auto",
        devices=1 if torch.cuda.is_available() else None,  # limiting got iPython runs
        max_epochs=3,
        callbacks=[LitProgressBar()],
        log_every_n_steps=100
    )
    

    Finaly:

    trainer.fit(mnist_model, train_loader)
    

    For the first epoch this will display:

    GPU available: False, used: False
    TPU available: False, using: 0 TPU cores
    IPU available: False, using: 0 IPUs
    HPU available: False, using: 0 HPUs
    
      | Name | Type   | Params
    --------------------------------
    0 | l1   | Linear | 7.9 K 
    --------------------------------
    7.9 K     Trainable params
    0         Non-trainable params
    7.9 K     Total params
    0.031     Total estimated model params size (MB)
    
    Epoch 1: 17%                        160/938 [00:02<00:11, 68.93it/s, loss=1.05, v_num=4]
    

    and not the default

    Epoch 0: 17%                        160/938 [00:02<00:11, 68.93it/s, loss=1.05, v_num=4]