Enhance a MarianMT pretrained model from HuggingFace with more training data

I am using a pretrained MarianMT machine translation model from English to German. I also have a large set of high quality English-to-German sentence pairs that I would like to use to enhance the performance of the model, which is trained on the OPUS corpus, but without making the model forget the OPUS training data. Is there a way to do that? Thanks.

Solution

Have you tried the finetune.sh script shown here? In addition to the short list of CLI flags listed there, you could try adding:

--src_lang "en" \
--tgt_lang "de" \
--num_train_epochs 400 \
--warmup_steps 20 \
--train_batch_size 32 \
--eval_batch_size 32 \
--data_dir "/data/dir" \
--output_dir "/path/to/store/model/etc" \
--cache_dir "/path/for/misc/files" \
--max_source_length 128 \
--max_target_length 128 \
--val_max_target_length 128 \
--test_max_target_length 128 \
--model_name_or_path "</path/to/pretrained>"

where the "/path/to/pretrained" could be either a local path on your machine or MarianMT model (Opus-en-de or equivalent). The "data/dir" has a "train.source" and "train.target" for the source & target languages, such that line number x of the target is a translation of line x in the source (and same with "val.source" and "val.target"). I have changed the finetune.py script here to

parser = TranslationModule.add_model_specific_args(parser, os.getcwd())

and then ran the finetune.sh script.

Note: The gradients blew up when I used the "fp16" flag (with Pytorch 1.6), so I had removed it. Also, you might want to check on the "val_check_interval", "check_val_every_n_epoch", and probably check this issue on how to save multiple checkpoints.