Search code examples
pythonnlpmachine-translationfine-tuningfairseq

NLLB Fine-Tuning Error: Missing data_prefix Configuration (English-German Translation)


I'm attempting to fine-tune the NLLB model "facebook/nllb-200-distilled-600M" for a scientific translation task from English (eng_Latn) to German (deu_Latn). I followed the official guidelines for fine-tuning by authors of nllb.

Documentation: link

This is the code block which is giving error:

DATA_CONFIG = "/content/sample_data/data_config.json"
OUTPUT_DIR = "/content/outputs"
MODEL_FOLDER = "/content/drive/MyDrive/Thesis/nllb-checkpoints"
DROP = 0.1
SRC = "eng_Latn"
TGT = "deu_Latn"
!python /content/fairseq/examples/nllb/modeling/train/train_script.py \
    cfg=nllb200_dense3.3B_finetune_on_fbseed \
    cfg/dataset=default \
    cfg.dataset.lang_pairs="$SRC-$TGT" \
    cfg.fairseq_root=$(pwd) \
    cfg.output_dir=$OUTPUT_DIR \
    cfg.dropout=$DROP \
    cfg.warmup=10 \
    cfg.finetune_from_model=$MODEL_FOLDER/checkpoint.pt

This is the error:

/content/fairseq/examples/nllb/modeling/train/train_script.py:287: UserWarning: 
The version_base parameter is not specified.
Please specify a compatability version level, or None.
Will assume defaults for version 1.1
  @hydra.main(config_path="conf", config_name="base_config")
/usr/local/lib/python3.10/dist-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
  ret = run_job(
TRAINING DIR:  /content/outputs
Error executing job with overrides: ['cfg=nllb200_dense3.3B_finetune_on_fbseed', 'cfg/dataset=default', 'cfg.dataset.lang_pairs=eng_Latn-deu_Latn', 'cfg.fairseq_root=/content', 'cfg.output_dir=/content/outputs', 'cfg.dropout=0.1', 'cfg.warmup=10', 'cfg.finetune_from_model=/content/drive/MyDrive/LASS_KG_Data/Thesis/nllb-checkpoints/checkpoint.pt']
Traceback (most recent call last):
  File "/content/fairseq/examples/nllb/modeling/train/train_script.py", line 289, in main
    train_module = TrainModule(config)
  File "/content/fairseq/examples/nllb/modeling/train/train_script.py", line 122, in __init__
    assert cluster_name in cfg.dataset.data_prefix
omegaconf.errors.ConfigAttributeError: Key 'data_prefix' is not in struct
    full_key: cfg.dataset.data_prefix
    object_type=dict

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

So far, I understand there is a Missing data_prefix configuration. I created a demo custom data_config.json. Which looks like this:

{
    "data_prefix": "/content/sample_data",
    "train_data": "train_demo.json",
    "test_data": "test_demo.json",
    "lang_pairs": "eng_Latn-deu_Latn"
}

While the official documentation provides some information, I'm encountering difficulties in applying it to my specific use case. Can someone share a detailed guide or point me to helpful resources on fine-tuning NLLB?


Solution

  • While I can't help you with the concrete error message you are getting (my guess would be issues with structure of the provided JSON files), my personal recommendation would be to fine-tune NLLB in the transformers library, specifically using the Seq2SeqTrainer.

    I did this before for multiple models, including NLLB, check out this repository: https://github.com/EliasK93/transformer-models-for-domain-specific-machine-translation/

    This way the fine-tuning and inference process for the NLLB model is the same as any bilingual model (you can find guides for those more easiely), with the only exception that you load the tokenizer like so:

    tokenizer = NllbTokenizer.from_pretrained(model_path, src_lang="eng_Latn", tgt_lang="deu_Latn")
    

    and generate translations like this:

    model.generate(tokenized_chunk.input_ids, forced_bos_token_id=tokenizer.encode("deu_Latn")[1], max_length=512)