Search code examples
pythonpytorchhuggingface-transformerslanguage-modelgpt-2

Huggingface Transformer - GPT2 resume training from saved checkpoint


Resuming the GPT2 finetuning, implemented from run_clm.py

Does GPT2 huggingface has a parameter to resume the training from the saved checkpoint, instead training again from the beginning? Suppose the python notebook crashes while training, the checkpoints will be saved, but when I train the model again still it starts the training from the beginning.

Source: here

finetuning code:

!python3 run_clm.py \
    --train_file source.txt \
    --do_train \
    --output_dir gpt-finetuned \
    --overwrite_output_dir \
    --per_device_train_batch_size 2 \
    --model_name_or_path=gpt2 \
    --save_steps 100 \
    --num_train_epochs=1 \
    --block_size=200 \
    --tokenizer_name=gpt2

From the above code, run_clm.py is a script provided by huggingface to finetune gpt2 to train with the customized dataset


Solution

  • To resume training from checkpoint you use the --model_name_or_path parameter. So instead of giving the default gpt2 you direct this to your latest checkpoint folder.

    So your command becomes:

    !python3 run_clm.py \
        --train_file source.txt \
        --do_train \
        --output_dir gpt-finetuned \
        --overwrite_output_dir \
        --per_device_train_batch_size 2 \
        --model_name_or_path=/content/models/checkpoint-5000 \
        --save_steps 100 \
        --num_train_epochs=1 \
        --block_size=200 \
        --tokenizer_name=gpt2