machine-learning deep-learning huggingface-transformers huggingface-trainer learning-rate

How to fix the learning-rate for Huggingface´s Trainer?

I'm training model with the following parameters:

Seq2SeqTrainingArguments(
    output_dir                   = "./out", 
    overwrite_output_dir         = True,
    do_train                     = True,
    do_eval                      = True,
    
    per_device_train_batch_size  = 2, 
    gradient_accumulation_steps  = 4,
    per_device_eval_batch_size   = 8, 
    
    learning_rate                = 1.25e-5,
    warmup_steps                 = 1,
    
    save_total_limit             = 1,
       
    evaluation_strategy          = "epoch",
    save_strategy                = "epoch",
    logging_strategy             = "epoch",  
    num_train_epochs             = 5,   
    
    gradient_checkpointing       = True,
    fp16                         = True,    
        
    predict_with_generate        = True,
    generation_max_length        = 225,
          
    report_to                    = ["tensorboard"],
    load_best_model_at_end       = True,
    metric_for_best_model        = "wer",
    greater_is_better            = False,
    push_to_hub                  = False,
)

I assume that warmup_steps=1 fixes the learning rate. However, after finished training I'm looking on the file trainer_state.json, and it seems that the learning rate is not fixed.

Here are the values of learning_rate and step:

learning_rate, steps

1.0006 e-05       1033
7.5062 e-06       2066
5.0058 e-06       3099
2.5053 e-06       4132
7.2618 e-09       5165

It seems that the learning rate is not fixed on 1.25e-5 (after step 1). What am I missing? How to I fix the learning rate.

Solution

A warm-up is in general an increase of the learning rate. It starts at 0 and then increases linearly over 1(here) step to the specified learning rate of 1.25e-5.

Afterwards by default a linear (in other cases a cosine) learning-rate scheduler decays your learning-rate.

To disable the decay add lr_scheduler_type='constant'. If I recall correctly, this also disables the warmup.
If you want warmup and afterwards a constant rate use constant_with_warmup instead.

EDIT: Valid scheduler types are defined in trainer_utils.py, in the class SchedulerType:

class SchedulerType(ExplicitEnum):
    """
    Scheduler names for the parameter `lr_scheduler_type` in [`TrainingArguments`].
    By default, it uses "linear". Internally, this retrieves `get_linear_schedule_with_warmup` scheduler from [`Trainer`].
    Scheduler types:
       - "linear" = get_linear_schedule_with_warmup
       - "cosine" = get_cosine_schedule_with_warmup
       - "cosine_with_restarts" = get_cosine_with_hard_restarts_schedule_with_warmup
       - "polynomial" = get_polynomial_decay_schedule_with_warmup
       - "constant" =  get_constant_schedule
       - "constant_with_warmup" = get_constant_schedule_with_warmup
       - "inverse_sqrt" = get_inverse_sqrt_schedule
       - "reduce_lr_on_plateau" = get_reduce_on_plateau_schedule
       - "cosine_with_min_lr" = get_cosine_with_min_lr_schedule_with_warmup
       - "warmup_stable_decay" = get_wsd_schedule
    """

    LINEAR = "linear"
    COSINE = "cosine"
    COSINE_WITH_RESTARTS = "cosine_with_restarts"
    POLYNOMIAL = "polynomial"
    CONSTANT = "constant"
    CONSTANT_WITH_WARMUP = "constant_with_warmup"
    INVERSE_SQRT = "inverse_sqrt"
    REDUCE_ON_PLATEAU = "reduce_lr_on_plateau"
    COSINE_WITH_MIN_LR = "cosine_with_min_lr"
    WARMUP_STABLE_DECAY = "warmup_stable_decay"