ValueError in pytorch lightining when training

When training a NLP model, it caused this error in epoch 50

  File "train/mrc_ner_trainer.py", line 431, in <module>
    if __name__ == '__main__':
  File "train/mrc_ner_trainer.py", line 418, in main
    
  File "/home/pkusam/anaconda3/envs/mrcNer3_7/lib/python3.7/site-packages/pytorch_lightning/trainer/states.py", line 48, in wrapped_fn
    result = fn(self, *args, **kwargs)
  File "/home/pkusam/anaconda3/envs/mrcNer3_7/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1073, in fit
    results = self.accelerator_backend.train(model)
  File "/home/pkusam/anaconda3/envs/mrcNer3_7/lib/python3.7/site-packages/pytorch_lightning/accelerators/gpu_backend.py", line 51, in train
    results = self.trainer.run_pretrain_routine(model)
  File "/home/pkusam/anaconda3/envs/mrcNer3_7/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1239, in run_pretrain_routine
    self.train()
  File "/home/pkusam/anaconda3/envs/mrcNer3_7/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 394, in train
    self.run_training_epoch()
  File "/home/pkusam/anaconda3/envs/mrcNer3_7/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 531, in run_training_epoch
    self.update_train_loop_lr_schedulers(monitor_metrics=monitor_metrics)
  File "/home/pkusam/anaconda3/envs/mrcNer3_7/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 599, in update_train_loop_lr_schedulers
    self.update_learning_rates(interval='step', monitor_metrics=monitor_metrics)
  File "/home/pkusam/anaconda3/envs/mrcNer3_7/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 1306, in update_learning_rates
    lr_scheduler['scheduler'].step()
  File "/home/pkusam/anaconda3/envs/mrcNer3_7/lib/python3.7/site-packages/torch/optim/lr_scheduler.py", line 154, in step
    values = self.get_lr()
  File "/home/pkusam/anaconda3/envs/mrcNer3_7/lib/python3.7/site-packages/torch/optim/lr_scheduler.py", line 1248, in get_lr
    .format(step_num + 1, self.total_steps))
ValueError: Tried to step 192652 times. The specified number of total steps is 192650

It is difficult to describe why it happen. I have successfully using the same code train this model for 45 epochs and save the checkpoint. Then I want to continue training this model for 55 epochs. There is nothing unusual happened in epoch 46-49.

update:

The calculation of t_total of onecycle in my code is:

t_total = (len(self.train_dataloader()) // (self.args.accumulate_grad_batches * num_gpus) + 1) * self.args.max_epochs

Every thing is ok when I train a new model, but it doesn't work if I train from exists checkpoint.

Here is the loading checkpoint:

    model = BertLabeling(args)
    if args.pretrained_checkpoint:
        model.load_state_dict(torch.load(args.pretrained_checkpoint,
                                         map_location=torch.device('cpu'))["state_dict"])

    checkpoint_callback = ModelCheckpoint(
        filepath=args.default_root_dir,
        save_top_k=args.max_keep_ckpt,
        verbose=True,
        monitor="span_f1",
        period=-1,
        mode="max",
    )
    trainer.fit(model)

And the configure_optimizers function is BertLabeling

def configure_optimizers(self):
    t_total = (len(self.train_dataloader()) // (self.args.accumulate_grad_batches * num_gpus) + 1) * self.args.max_epochs
    if self.args.lr_scheduler == "onecycle":
       scheduler = torch.optim.lr_scheduler.OneCycleLR(
           optimizer, max_lr=self.args.lr, 
           pct_start=float(self.args.warmup_steps/t_total),
           final_div_factor=self.args.final_div_factor,
           total_steps=t_total, anneal_strategy='linear'
           )
     # other scheduler.....

Then, when loading checkpoints, I hardcode t_total=192650 * 100, it still says ValueError: Tried to step 192652 times. The specified number of total steps is 192650

I guess When I load the checkpoint, I load in the existing optimizers configuration (current steps, total steps, current lr and so on). If so how to reset the optimizer.

Solution

u can

remove OneCycleLR and use trainer.auto_lr_find=True then all works fine. u can learn more here.