I'm training model with the following parameters:
Seq2SeqTrainingArguments(
output_dir = "./out",
overwrite_output_dir = True,
do_train = True,
do_eval = True,
per_device_train_batch_size = 2,
gradient_accumulation_steps = 4,
per_device_eval_batch_size = 8,
learning_rate = 1.25e-5,
warmup_steps = 1,
save_total_limit = 1,
evaluation_strategy = "epoch",
save_strategy = "epoch",
logging_strategy = "epoch",
num_train_epochs = 5,
gradient_checkpointing = True,
fp16 = True,
predict_with_generate = True,
generation_max_length = 225,
report_to = ["tensorboard"],
load_best_model_at_end = True,
metric_for_best_model = "wer",
greater_is_better = False,
push_to_hub = False,
)
I assume that warmup_steps=1
fixes the learning rate.
However, after finished training I'm looking on the file trainer_state.json
, and it seems that the learning rate is not fixed.
Here are the values of learning_rate and step:
learning_rate, steps
1.0006 e-05 1033
7.5062 e-06 2066
5.0058 e-06 3099
2.5053 e-06 4132
7.2618 e-09 5165
It seems that the learning rate is not fixed on 1.25e-5 (after step 1). What am I missing? How to I fix the learning rate.
A warm-up is in general an increase of the learning rate. It starts at 0 and then increases linearly over 1(here) step to the specified learning rate of 1.25e-5
.
Afterwards by default a linear (in other cases a cosine) learning-rate scheduler decays your learning-rate.
To disable the decay add lr_scheduler_type='constant'
.
If I recall correctly, this also disables the warmup.
If you want warmup and afterwards a constant rate use constant_with_warmup
instead.
EDIT: Valid scheduler types are defined in trainer_utils.py, in the class SchedulerType:
class SchedulerType(ExplicitEnum):
"""
Scheduler names for the parameter `lr_scheduler_type` in [`TrainingArguments`].
By default, it uses "linear". Internally, this retrieves `get_linear_schedule_with_warmup` scheduler from [`Trainer`].
Scheduler types:
- "linear" = get_linear_schedule_with_warmup
- "cosine" = get_cosine_schedule_with_warmup
- "cosine_with_restarts" = get_cosine_with_hard_restarts_schedule_with_warmup
- "polynomial" = get_polynomial_decay_schedule_with_warmup
- "constant" = get_constant_schedule
- "constant_with_warmup" = get_constant_schedule_with_warmup
- "inverse_sqrt" = get_inverse_sqrt_schedule
- "reduce_lr_on_plateau" = get_reduce_on_plateau_schedule
- "cosine_with_min_lr" = get_cosine_with_min_lr_schedule_with_warmup
- "warmup_stable_decay" = get_wsd_schedule
"""
LINEAR = "linear"
COSINE = "cosine"
COSINE_WITH_RESTARTS = "cosine_with_restarts"
POLYNOMIAL = "polynomial"
CONSTANT = "constant"
CONSTANT_WITH_WARMUP = "constant_with_warmup"
INVERSE_SQRT = "inverse_sqrt"
REDUCE_ON_PLATEAU = "reduce_lr_on_plateau"
COSINE_WITH_MIN_LR = "cosine_with_min_lr"
WARMUP_STABLE_DECAY = "warmup_stable_decay"