python nlp huggingface-transformers huggingface-trainer

Seq2Seq trainer.train() keeps giving indexing error

I am trying to do a machine translation from Hindi to Sanskrit using NLLB model. But I keep getting the error:

IndexError: Invalid key: 39463 is out of bounds for size 0.

The error is coming when training the pretrained NLLB model `facebook/nllb-200-1.3B
The input data is ~40k Hindi sentences. The same error arises when I tried training with a sample data also.

Detailed error message:

Traceback (most recent call last):
  File "nllbtrain.py", line 273, in <module>
    print(trainer.train())
  File "/home//.conda/envs/dict/lib/python3.8/site-packages/transformers/trainer.py", line 1645, in train
    return inner_training_loop(
  File "/home//.conda/envs/dict/lib/python3.8/site-packages/transformers/trainer.py", line 1907, in _inner_training_loop
    for step, inputs in enumerate(epoch_iterator):
  File "/home//.conda/envs/dict/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 631, in __next__
    data = self._next_data()
  File "/home//.conda/envs/dict/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 675, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/home//.conda/envs/dict/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
    data = self.dataset.__getitems__(possibly_batched_index)
  File "/home//.conda/envs/dict/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 2814, in __getitems__
    batch = self.__getitem__(keys)
  File "/home//.conda/envs/dict/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 2810, in __getitem__
    return self._getitem(key)
  File "/home//.conda/envs/dict/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 2794, in _getitem
    pa_subtable = query_table(self._data, key, indices=self._indices)
  File "/home//.conda/envs/dict/lib/python3.8/site-packages/datasets/formatting/formatting.py", line 583, in query_table
    _check_valid_index_key(key, size)
  File "/home//.conda/envs/dict/lib/python3.8/site-packages/datasets/formatting/formatting.py", line 536, in _check_valid_index_key
    _check_valid_index_key(int(max(key)), size=size)
  File "/home//.conda/envs/dict/lib/python3.8/site-packages/datasets/formatting/formatting.py", line 526, in _check_valid_index_key
    raise IndexError(f"Invalid key: {key} is out of bounds for size {size}")
IndexError: Invalid key: 39463 is out of bounds for size 0
  0%|

The code of the preprocessing done for the data:

def preprocess_function(examples):
        inputs = [example + ' </s>' + f' <2{s_lang}>' for example in examples[source_lang]]
        targets = [f'<2{t_lang}> ' + example + ' </s>' for example in examples[target_lang]]

        model_inputs = tokenizer.batch_encode_plus(inputs, max_length=max_input_length, truncation=True, padding='max_length')
        # model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)

        with tokenizer.as_target_tokenizer():
            # labels = tokenizer(targets, max_length=max_target_length, truncation=True)
            labels = tokenizer.batch_encode_plus(targets, max_length=max_input_length, truncation=True, padding='max_length')

        model_inputs['labels'] = labels['input_ids']

        return model_inputs

Data after preprocessing:

DatasetDict({
    train: Dataset({
        features: ['Hindi', 'Sanskrit', '__index_level_0__', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 39729
    })
    val: Dataset({
        features: ['Hindi', 'Sanskrit', '__index_level_0__', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 2210
    })
    test: Dataset({
        features: ['Hindi', 'Sanskrit', '__index_level_0__', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 2214
    })
})

The code of model params and training:

model_path = 'facebook/nllb-200-1.3B'
model = AutoModelForSeq2SeqLM.from_pretrained(pretrained_model_name_or_path =model_path)
tokenizer = AutoTokenizer.from_pretrained('facebook/nllb-200-1.3B', do_lower_case=False, use_fast=False, truncation=True, xkeep_accents=True, src_lang="hin_Deva", tgt_lang="san_Deva", max_length = 500)

training_args = Seq2SeqTrainingArguments(
    evaluation_strategy="epoch",
    save_strategy='epoch',
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    output_dir="./output_dir",
    weight_decay=0.01,
    save_total_limit=1,
    num_train_epochs=4,
    predict_with_generate=True,
    fp16=False,
    push_to_hub=False,
)
trainer = Seq2SeqTrainer(
    model=model,
    tokenizer=tokenizer,
    args=training_args,
    train_dataset=dataset['train'],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)
print(trainer.train())

Any idea why this error is persisting?

Solution

size 0 indicates that the dataset your trainer gets when the fine-tuning starts is empty. Looking at this (https://discuss.huggingface.co/t/indexerror-invalid-key-16-is-out-of-bounds-for-size-0/14298/25) and this (https://github.com/huggingface/datasets/issues/6535) thread suggests adding remove_unused_columns = False to your training_args might resolve the issue, so you could give that a try.