pytorch huggingface-transformers huggingface-tokenizers huggingface-datasets

Setting `remove_unused_columns=False` causes error in HuggingFace Trainer class

I am training a model using HuggingFace Trainer class. The following code does a decent job:

!pip install datasets
!pip install transformers

from datasets import load_dataset
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer, AutoTokenizer

dataset = load_dataset('glue', 'mnli')
model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=3)
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased', use_fast=True)

def preprocess_function(examples):
  return tokenizer(examples["premise"], examples["hypothesis"], truncation=True, padding=True)
encoded_dataset = dataset.map(preprocess_function, batched=True)

args = TrainingArguments(
    "test-glue",
    learning_rate=3e-5,
    per_device_train_batch_size=8,
    num_train_epochs=3,
    remove_unused_columns=True
  )

trainer = Trainer(
    model,
    args,
    train_dataset=encoded_dataset["train"],
    tokenizer=tokenizer
)
trainer.train()

However, setting remove_unused_columns=False results in the following error:

ValueError                                Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_base.py in convert_to_tensors(self, tensor_type, prepend_batch_axis)
    704                 if not is_tensor(value):
--> 705                     tensor = as_tensor(value)
    706 

ValueError: too many dimensions 'str'

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
8 frames
/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_base.py in convert_to_tensors(self, tensor_type, prepend_batch_axis)
    720                     )
    721                 raise ValueError(
--> 722                     "Unable to create tensor, you should probably activate truncation and/or padding "
    723                     "with 'padding=True' 'truncation=True' to have batched tensors with the same length."
    724                 )

ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length.

Any suggestions are highly appreciated.

Solution

It fails because the value in line 705 is a list of str, which points to hypothesis. And hypothesis is one of the ignored_columns in trainer.py.

/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_base.py in convert_to_tensors(self, tensor_type, prepend_batch_axis)
    704                 if not is_tensor(value):
--> 705                     tensor = as_tensor(value)

See the below snippet from trainer.py for the remove_unused_columns flag:

def _remove_unused_columns(self, dataset: "datasets.Dataset", description: Optional[str] = None):
    if not self.args.remove_unused_columns:
        return dataset
    if self._signature_columns is None:
        # Inspect model forward signature to keep only the arguments it accepts.
        signature = inspect.signature(self.model.forward)
        self._signature_columns = list(signature.parameters.keys())
        # Labels may be named label or label_ids, the default data collator handles that.
        self._signature_columns += ["label", "label_ids"]
    columns = [k for k in self._signature_columns if k in dataset.column_names]
    ignored_columns = list(set(dataset.column_names) - set(self._signature_columns))

There could be a potential pull request on HuggingFace to provide a fallback option in case the flag is False. But in general, it looks like that the flag implementation is not complete for e.g. it can't be used with Tensorflow.

On the contrary, it doesn't hurt to keep it True, unless there is some special need.