When training a NLP model, it caused this error in epoch 50
File "train/mrc_ner_trainer.py", line 431, in <module>
if __name__ == '__main__':
File "train/mrc_ner_trainer.py", line 418, in main
File "/home/pkusam/anaconda3/envs/mrcNer3_7/lib/python3.7/site-packages/pytorch_lightning/trainer/states.py", line 48, in wrapped_fn
result = fn(self, *args, **kwargs)
File "/home/pkusam/anaconda3/envs/mrcNer3_7/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1073, in fit
results = self.accelerator_backend.train(model)
File "/home/pkusam/anaconda3/envs/mrcNer3_7/lib/python3.7/site-packages/pytorch_lightning/accelerators/gpu_backend.py", line 51, in train
results = self.trainer.run_pretrain_routine(model)
File "/home/pkusam/anaconda3/envs/mrcNer3_7/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1239, in run_pretrain_routine
self.train()
File "/home/pkusam/anaconda3/envs/mrcNer3_7/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 394, in train
self.run_training_epoch()
File "/home/pkusam/anaconda3/envs/mrcNer3_7/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 531, in run_training_epoch
self.update_train_loop_lr_schedulers(monitor_metrics=monitor_metrics)
File "/home/pkusam/anaconda3/envs/mrcNer3_7/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 599, in update_train_loop_lr_schedulers
self.update_learning_rates(interval='step', monitor_metrics=monitor_metrics)
File "/home/pkusam/anaconda3/envs/mrcNer3_7/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 1306, in update_learning_rates
lr_scheduler['scheduler'].step()
File "/home/pkusam/anaconda3/envs/mrcNer3_7/lib/python3.7/site-packages/torch/optim/lr_scheduler.py", line 154, in step
values = self.get_lr()
File "/home/pkusam/anaconda3/envs/mrcNer3_7/lib/python3.7/site-packages/torch/optim/lr_scheduler.py", line 1248, in get_lr
.format(step_num + 1, self.total_steps))
ValueError: Tried to step 192652 times. The specified number of total steps is 192650
It is difficult to describe why it happen. I have successfully using the same code train this model for 45 epochs and save the checkpoint. Then I want to continue training this model for 55 epochs. There is nothing unusual happened in epoch 46-49.
update:
The calculation of t_total of onecycle in my code is:
t_total = (len(self.train_dataloader()) // (self.args.accumulate_grad_batches * num_gpus) + 1) * self.args.max_epochs
Every thing is ok when I train a new model, but it doesn't work if I train from exists checkpoint.
Here is the loading checkpoint:
model = BertLabeling(args)
if args.pretrained_checkpoint:
model.load_state_dict(torch.load(args.pretrained_checkpoint,
map_location=torch.device('cpu'))["state_dict"])
checkpoint_callback = ModelCheckpoint(
filepath=args.default_root_dir,
save_top_k=args.max_keep_ckpt,
verbose=True,
monitor="span_f1",
period=-1,
mode="max",
)
trainer.fit(model)
And the configure_optimizers
function is BertLabeling
def configure_optimizers(self):
t_total = (len(self.train_dataloader()) // (self.args.accumulate_grad_batches * num_gpus) + 1) * self.args.max_epochs
if self.args.lr_scheduler == "onecycle":
scheduler = torch.optim.lr_scheduler.OneCycleLR(
optimizer, max_lr=self.args.lr,
pct_start=float(self.args.warmup_steps/t_total),
final_div_factor=self.args.final_div_factor,
total_steps=t_total, anneal_strategy='linear'
)
# other scheduler.....
Then, when loading checkpoints, I hardcode t_total=192650 * 100
, it still says ValueError: Tried to step 192652 times. The specified number of total steps is 192650
I guess When I load the checkpoint, I load in the existing optimizers configuration (current steps, total steps, current lr and so on). If so how to reset the optimizer.
u can
remove OneCycleLR and use trainer.auto_lr_find=True then all works fine. u can learn more here.