[Warning!! pedantry inside]
I'm using PyTorch Lightning to wrap my PyTorch model, but because I'm pedantic, I am finding the logger to be frustrating in the way it reports the steps at the frequency I've asked for, minus 1:
log_every_n_steps=100
in Trainer
, my Tensorboard output shows my metrics at step 99, 199, 299, etc. Why not at 100, 200, 300?check_val_every_n_epoch=30
in Trainer
, my console output shows progress bar goes up to epoch 29, then does a validate
, leaving a trail of console outputs that report metrics after epochs 29, 59, 89, etc. Like this:Epoch 29: 100%|█████████████████████████████| 449/449 [00:26<00:00, 17.01it/s, loss=0.642, v_num=logs]
[validation] {'roc_auc': 0.663, 'bacc': 0.662, 'f1': 0.568, 'loss': 0.633}
Epoch 59: 100%|█████████████████████████████| 449/449 [00:26<00:00, 16.94it/s, loss=0.626, v_num=logs]
[validation] {'roc_auc': 0.665, 'bacc': 0.652, 'f1': 0.548, 'loss': 0.630}
Epoch 89: 100%|█████████████████████████████| 449/449 [00:27<00:00, 16.29it/s, loss=0.624, v_num=logs]
[validation] {'roc_auc': 0.665, 'bacc': 0.652, 'f1': 0.548, 'loss': 0.627}
Am I doing something wrong? Should I simply submit a PR to PL to fix this?
You are not doing anything wrong. Python uses zero-based indexing so epoch counting starts at zero as well. If you want to change the behavior of what is being displayed you will need to override the default TQDMProgressBar
and modify on_train_epoch_start
to display an offsetted value. You can achieve this by:
from pytorch_lightning.callbacks.progress.tqdm_progress import convert_inf
class LitProgressBar(TQDMProgressBar):
def init_validation_tqdm(self):
bar = super().init_validation_tqdm()
bar.set_description("running validation...")
return bar
def on_train_epoch_start(self, trainer, *_) -> None:
total_train_batches = self.total_train_batches
total_val_batches = self.total_val_batches
if total_train_batches != float("inf") and total_val_batches != float("inf"):
# val can be checked multiple times per epoch
val_checks_per_epoch = total_train_batches // trainer.val_check_batch
total_val_batches = total_val_batches * val_checks_per_epoch
total_batches = total_train_batches + total_val_batches
self.main_progress_bar.reset(convert_inf(total_batches))
self.main_progress_bar.set_description(f"Epoch {trainer.current_epoch + 1}")
Notice the +1 in the last line of code. This will offset the epoch displayed in the progress bar. Then pass your custom bar to your trainer:
# Initialize a trainer
trainer = Trainer(
accelerator="auto",
devices=1 if torch.cuda.is_available() else None, # limiting got iPython runs
max_epochs=3,
callbacks=[LitProgressBar()],
log_every_n_steps=100
)
Finaly:
trainer.fit(mnist_model, train_loader)
For the first epoch this will display:
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
| Name | Type | Params
--------------------------------
0 | l1 | Linear | 7.9 K
--------------------------------
7.9 K Trainable params
0 Non-trainable params
7.9 K Total params
0.031 Total estimated model params size (MB)
Epoch 1: 17% 160/938 [00:02<00:11, 68.93it/s, loss=1.05, v_num=4]
and not the default
Epoch 0: 17% 160/938 [00:02<00:11, 68.93it/s, loss=1.05, v_num=4]