machine-learning huggingface-transformers huggingface huggingface-tokenizers huggingface-trainer

How to add EOS when training T5?

I'm a little puzzled where (and if) EOS tokens are being added when using Huggignface's trainer classes to train a T5 (LongT5 actually) model.

The data set contains pairs of text like this:

from	to
some text	some corresponding text
some other text	some other corresponding text

The tokenizer has been custom trained:

tokenizer = SentencePieceUnigramTokenizer()
tokenizer.train_from_iterator(iterator=iterator, vocab_size=32_128, show_progress=True, unk_token="<unk>")

and is loaded like this:

tokenizer = T5TokenizerFast(tokenizer_file="data-rb-25000/tokenizer.json",  
                            padding=True, bos_token="<s>", 
                            eos_token="</s>",unk_token="<unk>", 
                            pad_token="<pad>")

Before training, the data set is tokenized and examples that have a too high token count are filtered out, like so:

MAX_SEQUENCE_LENGTH = 16_384 / 2

def preprocess_function(examples):
    inputs = tokenizer(
        examples['from'],
        truncation=False,  # Don't truncate yet
        padding=False,     # Don't pad yet
        return_length=True,
    )
    labels = tokenizer(
        examples['to'],
        truncation=False,
        padding=False,
        return_length=True,
    )

    inputs["input_length"] = inputs["length"]
    inputs["labels"] = labels["input_ids"]
    inputs["label_length"] = labels["length"]

    inputs.pop("length", None)

    return inputs

tokenized_data = dataset.map(preprocess_function, batched=True, remove_columns=dataset["train"].column_names)

def filter_function(example):
    return example['input_length'] <= MAX_SEQUENCE_LENGTH and example['label_length'] <= MAX_SEQUENCE_LENGTH

filtered_data = tokenized_data.filter(filter_function)

Training is done like this:

from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model="google/long-t5-tglobal-base")

from transformers import AutoModelForSeq2SeqLM, AutoConfig

config = AutoConfig.from_pretrained(
    "google/long-t5-tglobal-base",
    vocab_size=len(tokenizer),
    pad_token_id=tokenizer.pad_token_id,
    eos_token_id=tokenizer.eos_token_id,
    decoder_start_token_id=tokenizer.pad_token_id,
)

model = AutoModelForSeq2SeqLM.from_config(config)

from transformers import GenerationConfig

generation_config = GenerationConfig.from_model_config(model.config)
generation_config._from_model_config = False
generation_config.max_new_tokens = 16_384

from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments(
    output_dir="rb-25000-model",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=16,
    gradient_checkpointing=True,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=5,
    logging_steps=1,
    predict_with_generate=True,
    load_best_model_at_end=True,
    bf16=True,
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=filtered_data["train"],
    eval_dataset=filtered_data["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    generation_config=generation_config,
)

trainer.train()

I know that the tokenizer doesn't add the EOS token:

inputs = tokenizer(['Hello world', 'Hello'], padding=True, truncation=True, max_length=100, return_tensors="pt")
labels = inputs["input_ids"]

print(labels)
print(tokenizer.convert_tokens_to_ids(['<s>'])[0])
print(tokenizer.convert_tokens_to_ids(['<pad>'])[0])
print(tokenizer.convert_tokens_to_ids(['<unk>'])[0])
print(tokenizer.convert_tokens_to_ids(['</s>'])[0])

print(tokenizer.convert_ids_to_tokens([1]))

Output:

tensor([[1, 10356, 1, 5056],
        [1, 10356, 16002, 16002]])
16000
16002
0
16001
['▁']

(I don't really understand what's that strange token with index 1.

Anyway, I was wondering if the Trainer class or the DataCollator actually adds the EOS. I did not find any examples online of how and where to add EOS.

I suspect it's not there, because after training the model it doesn't stop generating until it reaches max_new_tokens (set to pretty high).

What's the best practice here? Where should I add EOS? Is there anything else about this code that should be checked or that looks weird for more experienced eyes?

Solution

The T5 tokenizer should end sequences by EOS token by default. Pretrained T5 tokenizer on HuggingFace does that by default. In fact, I found the function that is responsible for that in the source code on line 256:

def _add_eos_if_not_present(self, token_ids: List[int]) -> List[int]:
        """Do not add eos again if user already added it."""
        if len(token_ids) > 0 and token_ids[-1] == self.eos_token_id:
            warnings.warn(
                f"This sequence already has {self.eos_token}. In future versions this behavior may lead to duplicated"
                " eos tokens being added."
            )
            return token_ids
        else:
            return token_ids + [self.eos_token_id]

If EOS token is not appended by default, you can add a post processor to your tokenizer using TemplateProcessing:

from tokenizers.processors import TemplateProcessing

tokenizer._tokenizer.post_processor = TemplateProcessing(
    single="$A </s>",
    pair="$A </s> $B </s>",
    special_tokens=[("</s>", tokenizer.eos_token_id)]
)

inputs = tokenizer(['Hello world', 'Hello'], padding=True, truncation=True, max_length=100, return_tensors="pt")
labels = inputs["input_ids"]
print(labels)

This should give:

tensor([[1, 10356, 1, 5056, 16001],
        [1, 10356, 16001, 16002, 16002]])