Search code examples
pythonhuggingface-transformers

Enhance a MarianMT pretrained model from HuggingFace with more training data


I am using a pretrained MarianMT machine translation model from English to German. I also have a large set of high quality English-to-German sentence pairs that I would like to use to enhance the performance of the model, which is trained on the OPUS corpus, but without making the model forget the OPUS training data. Is there a way to do that? Thanks.


Solution

  • Have you tried the finetune.sh script shown here? In addition to the short list of CLI flags listed there, you could try adding:

    --src_lang "en" \
    --tgt_lang "de" \
    --num_train_epochs 400 \
    --warmup_steps 20 \
    --train_batch_size 32 \
    --eval_batch_size 32 \
    --data_dir "/data/dir" \
    --output_dir "/path/to/store/model/etc" \
    --cache_dir "/path/for/misc/files" \
    --max_source_length 128 \
    --max_target_length 128 \
    --val_max_target_length 128 \
    --test_max_target_length 128 \
    --model_name_or_path "</path/to/pretrained>"
    

    where the "/path/to/pretrained" could be either a local path on your machine or MarianMT model (Opus-en-de or equivalent). The "data/dir" has a "train.source" and "train.target" for the source & target languages, such that line number x of the target is a translation of line x in the source (and same with "val.source" and "val.target"). I have changed the finetune.py script here to

    parser = TranslationModule.add_model_specific_args(parser, os.getcwd())
    
    

    and then ran the finetune.sh script.

    Note: The gradients blew up when I used the "fp16" flag (with Pytorch 1.6), so I had removed it. Also, you might want to check on the "val_check_interval", "check_val_every_n_epoch", and probably check this issue on how to save multiple checkpoints.