Search code examples
pythonhuggingface-transformerspre-trained-modelmachine-translationfine-tuning

How can I fine-tune mBART-50 for machine translation in the transformers Python library so that it learns a new word?


I try to fine-tune mBART-50 (paper, pre-trained model on Hugging Face) for machine translation in the transformers Python library. To test the fine-tuning, I am trying to simply teach mBART-50 a new word that I made up.

I use the following code. Over 95% of the code is from the Hugging Face documentation:

from transformers import MBartForConditionalGeneration, MBart50TokenizerFast

print('Model loading started')
model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-50")
tokenizer = MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50", src_lang="fr_XX", tgt_lang="en_XX")
print('Model loading done')

src_text = " billozarion "
tgt_text =  " plorization "

model_inputs = tokenizer(src_text, return_tensors="pt")
with tokenizer.as_target_tokenizer():
    labels = tokenizer(tgt_text, return_tensors="pt").input_ids

print('Fine-tuning started')
for i in range(1000):
    #pass
    model(**model_inputs, labels=labels) # forward pass
print('Fine-tuning ended')
    
# Testing whether the model learned the new word. Translate French to English
tokenizer = MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50-many-to-many-mmt")
tokenizer.src_lang = "fr_XX"
article_fr = src_text
encoded_fr = tokenizer(article_fr, return_tensors="pt")
generated_tokens = model.generate(**encoded_fr, forced_bos_token_id=tokenizer.lang_code_to_id["en_XX"])
translation = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
print(translation)

However, the new word wasn't learned. The output is "billozarion" instead of "plorization". Why?

I'm strictly following the Hugging Face documentation, unless I missed something. The # forward pass does make me concerned, as one would need a backward pass to update the gradients. Maybe this means that the documentation is incorrect, however I can't test that hypothesis as I don't know how to add the backward pass.


Environment that I used to run the code: Ubuntu 20.04.5 LTS with an NVIDIA A100 40GB GPU (I also tested with an NVIDIA T4 Tensor Core GPU) and CUDA 12.0 with the following conda environment:

conda create --name mbart-python39 python=3.9
conda activate mbart-python39 
pip install transformers==4.28.1
pip install chardet==5.1.0
pip install sentencepiece==0.1.99
pip install protobuf==3.20

Solution

  • One could add the following to fine-tune mBART-50:

    from transformers.optimization import AdamW
    
    # Set up the optimizer and training settings
    optimizer = AdamW(model.parameters(), lr=1e-4)
    model.train()
    
    print('Fine-tuning started')
    for i in range(100):
        optimizer.zero_grad()
        output = model(**model_inputs, labels=labels) # forward pass
        loss = output.loss
        loss.backward()
        optimizer.step()
    print('Fine-tuning ended')
    

    Full code:

    from transformers import MBartForConditionalGeneration, MBart50TokenizerFast
    from transformers.optimization import AdamW
    import os
    os.environ["TOKENIZERS_PARALLELISM"] = "false"
    
    
    print('Model loading started')
    model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-50")
    tokenizer = MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50", src_lang="fr_XX", tgt_lang="en_XX")
    print('Model loading done')
    
    src_text = " billozarion "
    tgt_text =  " plorizatizzzon "
    
    model_inputs = tokenizer(src_text, return_tensors="pt")
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(tgt_text, return_tensors="pt").input_ids
    
    # Set up the optimizer and training settings
    optimizer = AdamW(model.parameters(), lr=1e-4)
    model.train()
    
    print('Fine-tuning started')
    for i in range(100):
        optimizer.zero_grad()
        output = model(**model_inputs, labels=labels) # forward pass
        loss = output.loss
        loss.backward()
        optimizer.step()
    print('Fine-tuning ended')
        
    # translate French to English
    tokenizer = MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50-many-to-many-mmt")
    tokenizer.src_lang = "fr_XX"
    article_fr = src_text
    encoded_fr = tokenizer(article_fr, return_tensors="pt")
    generated_tokens = model.generate(**encoded_fr, forced_bos_token_id=tokenizer.lang_code_to_id["en_XX"])
    translation =tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
    print(translation)
    

    It outputs the correct made up translation "plorizatizzzon".

    I reported the documentation issue on https://github.com/huggingface/transformers/issues/23185


    https://github.com/huggingface/transformers/tree/main/examples/pytorch/translation contains two more advanced scripts to fine-tune mBART and T5 (thanks sgugger for pointing me to it). Here is how to use the script to fine-tune mBART:

    Create a new conda environment:

    conda create --name mbart-source-transformers-python39 python=3.9
    conda activate mbart-source-transformers-python39 
    git clone https://github.com/huggingface/transformers.git
    cd transformers
    pip install git+https://github.com/huggingface/transformers
    pip install datasets evaluate accelerate sacrebleu
    conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
    pip install sentencepiece==0.1.99
    pip install protobuf==3.20
    pip install --force-reinstall charset-normalizer==3.1.0
    

    Command:

    python examples/pytorch/translation/run_translation.py \
        --model_name_or_path facebook/mbart-large-50 \
        --do_train \
        --do_eval \
        --source_lang fr_XX \
        --target_lang en_XX \
        --source_prefix "translate French to English: " \
        --train_file finetuning-translation-train.json \
        --validation_file finetuning-translation-validation.json  \
        --test_file finetuning-translation-test.json \
        --output_dir tmp/tst-translation4 \
        --per_device_train_batch_size=4 \
        --per_device_eval_batch_size=4 \
        --overwrite_output_dir \
        --do_predict \
        --predict_with_generate
    

    (Note: the readme seems to have missed --do_predict)

    with finetuning-translation-train.json, finetuning-translation-validation.json and finetuning-translation-test.json formatted as follows with the JSON Lines format:

    {"translation": {"en": "20 year-old male tennis player.", "fr": "Joueur de tennis de 12 ans"}}
    {"translation": {"en": "2 soldiers in an old military Jeep", "fr": "2 soldats dans une vielle Jeep militaire"}}
    

    (Note: one must use double quotes in the .json files. Single quotes e.g. 'en' will make the script crash.)

    I run the code on Ubuntu 20.04.5 LTS with an NVIDIA T4 Tensor Core GPU (16GB memory) and CUDA 12.0. The mBART-50 model takes around 15GB of GPU memory.