deep-learning pytorch huggingface-transformers language-model gpt-2

Modifying the Learning Rate in the middle of the Model Training in Deep Learning

Below is the code to configure TrainingArguments consumed from the HuggingFace transformers library to finetune the GPT2 language model.

training_args = TrainingArguments(
        output_dir="./gpt2-language-model", #The output directory
        num_train_epochs=100, # number of training epochs
        per_device_train_batch_size=8, # batch size for training #32, 10
        per_device_eval_batch_size=8,  # batch size for evaluation #64, 10
        save_steps=100, # after # steps model is saved
        warmup_steps=500,# number of warmup steps for learning rate scheduler
        prediction_loss_only=True,
        metric_for_best_model = "eval_loss",
        load_best_model_at_end = True,
        evaluation_strategy="epoch",
        learning_rate=0.00004, # learning rate
    )

early_stop_callback = EarlyStoppingCallback(early_stopping_patience  = 3)
    
trainer = Trainer(
        model=gpt2_model,
        args=training_args,
        data_collator=data_collator,
        train_dataset=train_dataset,
        eval_dataset=test_dataset,
        callbacks = [early_stop_callback],
 )

The number of epochs as 100 and learning_rate as 0.00004 and also the early_stopping is configured with the patience value as 3.

The model ran for 5/100 epochs and noticed that the difference in loss_value is negligible. The latest checkpoint is saved as checkpoint-latest.

Now Can I modify the learning_rate may be to 0.01 from 0.00004 and resume the training from the latest saved checkpoint - checkpoint-latest? Doing that will be efficient?

Or to train with the new learning_rate value should I start the training from the beginning?

Solution

No, you don't have to restart your training.

Changing the learning rate is like changing how big a step your model take in the direction determined by your loss function.

You can also think of it as transfer learning where the model has some experience (no matter how little or irrelevant) and the weights are in a state most likely better than a randomly initialised one.

As a matter of fact, changing the learning rate mid-training is considered an art in deep learning and you should change it if you have a very very good reason to do it.

You would probably want to write down when (why, what, etc) you did it if you or someone else wants to "reproduce" the result of your model.