I am trying to do a machine translation from Hindi to Sanskrit using NLLB model. But I keep getting the error:
IndexError: Invalid key: 39463 is out of bounds for size 0.
Detailed error message:
Traceback (most recent call last):
File "nllbtrain.py", line 273, in <module>
print(trainer.train())
File "/home//.conda/envs/dict/lib/python3.8/site-packages/transformers/trainer.py", line 1645, in train
return inner_training_loop(
File "/home//.conda/envs/dict/lib/python3.8/site-packages/transformers/trainer.py", line 1907, in _inner_training_loop
for step, inputs in enumerate(epoch_iterator):
File "/home//.conda/envs/dict/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 631, in __next__
data = self._next_data()
File "/home//.conda/envs/dict/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 675, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/home//.conda/envs/dict/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
data = self.dataset.__getitems__(possibly_batched_index)
File "/home//.conda/envs/dict/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 2814, in __getitems__
batch = self.__getitem__(keys)
File "/home//.conda/envs/dict/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 2810, in __getitem__
return self._getitem(key)
File "/home//.conda/envs/dict/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 2794, in _getitem
pa_subtable = query_table(self._data, key, indices=self._indices)
File "/home//.conda/envs/dict/lib/python3.8/site-packages/datasets/formatting/formatting.py", line 583, in query_table
_check_valid_index_key(key, size)
File "/home//.conda/envs/dict/lib/python3.8/site-packages/datasets/formatting/formatting.py", line 536, in _check_valid_index_key
_check_valid_index_key(int(max(key)), size=size)
File "/home//.conda/envs/dict/lib/python3.8/site-packages/datasets/formatting/formatting.py", line 526, in _check_valid_index_key
raise IndexError(f"Invalid key: {key} is out of bounds for size {size}")
IndexError: Invalid key: 39463 is out of bounds for size 0
0%|
The code of the preprocessing done for the data:
def preprocess_function(examples):
inputs = [example + ' </s>' + f' <2{s_lang}>' for example in examples[source_lang]]
targets = [f'<2{t_lang}> ' + example + ' </s>' for example in examples[target_lang]]
model_inputs = tokenizer.batch_encode_plus(inputs, max_length=max_input_length, truncation=True, padding='max_length')
# model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)
with tokenizer.as_target_tokenizer():
# labels = tokenizer(targets, max_length=max_target_length, truncation=True)
labels = tokenizer.batch_encode_plus(targets, max_length=max_input_length, truncation=True, padding='max_length')
model_inputs['labels'] = labels['input_ids']
return model_inputs
Data after preprocessing:
DatasetDict({
train: Dataset({
features: ['Hindi', 'Sanskrit', '__index_level_0__', 'input_ids', 'attention_mask', 'labels'],
num_rows: 39729
})
val: Dataset({
features: ['Hindi', 'Sanskrit', '__index_level_0__', 'input_ids', 'attention_mask', 'labels'],
num_rows: 2210
})
test: Dataset({
features: ['Hindi', 'Sanskrit', '__index_level_0__', 'input_ids', 'attention_mask', 'labels'],
num_rows: 2214
})
})
The code of model params and training:
model_path = 'facebook/nllb-200-1.3B'
model = AutoModelForSeq2SeqLM.from_pretrained(pretrained_model_name_or_path =model_path)
tokenizer = AutoTokenizer.from_pretrained('facebook/nllb-200-1.3B', do_lower_case=False, use_fast=False, truncation=True, xkeep_accents=True, src_lang="hin_Deva", tgt_lang="san_Deva", max_length = 500)
training_args = Seq2SeqTrainingArguments(
evaluation_strategy="epoch",
save_strategy='epoch',
learning_rate=2e-5,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
output_dir="./output_dir",
weight_decay=0.01,
save_total_limit=1,
num_train_epochs=4,
predict_with_generate=True,
fp16=False,
push_to_hub=False,
)
trainer = Seq2SeqTrainer(
model=model,
tokenizer=tokenizer,
args=training_args,
train_dataset=dataset['train'],
data_collator=data_collator,
compute_metrics=compute_metrics,
)
print(trainer.train())
Any idea why this error is persisting?
size 0
indicates that the dataset your trainer gets when the fine-tuning starts is empty. Looking at this (https://discuss.huggingface.co/t/indexerror-invalid-key-16-is-out-of-bounds-for-size-0/14298/25) and this (https://github.com/huggingface/datasets/issues/6535) thread suggests adding remove_unused_columns = False
to your training_args
might resolve the issue, so you could give that a try.