I am trying to use data_collator
function in hugging face using this code:
datasets = dataset.train_test_split(test_size=0.1)
train_dataset = datasets["train"]
val_dataset = datasets["test"]
print(type(train_dataset))
def data_collator(data):
# Initialize lists to store pixel values and input ids
pixel_values_list = []
input_ids_list = []
# Iterate over each sample in the data
for item in data:
pixel_values_list.append(torch.tensor(item["pixel_values"]))
input_ids_list.append(torch.tensor(item["input_ids"]))
return {
"pixel_values": torch.stack(pixel_values_list),
"labels": torch.stack(input_ids_list)
}
the train_data has 5 keys including input_ids
. However, when I print(data[0])
inside the data_collator
function, I only see 1 key, which is giving an error when running the trainer:
Traceback (most recent call last):
File "caption-code.py", line 134, in <module>
trainer.train()
File "C:\Users\moham\anaconda3\envs\transformer\lib\site-
packages\transformers\trainer.py", line 1321, in train
ignore_keys_for_eval=ignore_keys_for_eval,
File "C:\Users\moham\anaconda3\envs\transformer\lib\site-
packages\transformers\trainer.py", line 1528, in _inner_training_loop
for step, inputs in enumerate(epoch_iterator):
File "C:\Users\moham\anaconda3\envs\transformer\lib\site-
packages\torch\utils\data\dataloader.py", line 521, in __next__
data = self._next_data()
File "C:\Users\moham\anaconda3\envs\transformer\lib\site-
packages\torch\utils\data\dataloader.py", line 561, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "C:\Users\moham\anaconda3\envs\transformer\lib\site-
packages\torch\utils\data\_utils\fetch.py", line 52, in fetch
return self.collate_fn(data)
File "caption-code.py", line 102, in data_collator
input_ids_list.append(item["input_ids"])
KeyError: 'input_ids'
I am using the trainer function as follows:
training_args = Seq2SeqTrainingArguments(
predict_with_generate=True,
evaluation_strategy="epoch",
per_device_train_batch_size=4,
per_device_eval_batch_size=4,
output_dir="C:/Users/moham/Desktop/Euler/output",
logging_dir="./logs",
logging_steps=10,
save_steps=10,
eval_steps=10,
warmup_steps=10,
max_steps=100, # adjust as needed
overwrite_output_dir=True,
save_total_limit=3,
)
trainer = Seq2SeqTrainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=val_dataset,
data_collator=data_collator,
tokenizer=tokenizer,
compute_metrics=compute_exact_match
)
trainer.train()
The actual issue is in your Seq2SeqTrainingArguments
which is leading the error in your data_collator()
.
Reason: The .trainer()
is by default removing any unknown columns (not present in the model's forward
method) from your data when you are providing a custom data_collator()
. As a result even though each sample in your train_dataset
has all the keys, when you send that to data_collator()
, the .trainer()
automatically removes the unknown columns.
Solution: You need to include an argument in your training arguments like the following:
training_args = Seq2SeqTrainingArguments(
predict_with_generate=True,
remove_unused_columns=False,
...)
The remove_unused_columns=False,
would prevent the default behaviour and you'd get the entire data in data_collator()
. This issue would be useful for further reference.