Search code examples
machine-learninghuggingface-transformershuggingfacehuggingface-tokenizershuggingface-trainer

How to add EOS when training T5?


I'm a little puzzled where (and if) EOS tokens are being added when using Huggignface's trainer classes to train a T5 (LongT5 actually) model.

The data set contains pairs of text like this:

from to
some text some corresponding text
some other text some other corresponding text

The tokenizer has been custom trained:

tokenizer = SentencePieceUnigramTokenizer()
tokenizer.train_from_iterator(iterator=iterator, vocab_size=32_128, show_progress=True, unk_token="<unk>")

and is loaded like this:

tokenizer = T5TokenizerFast(tokenizer_file="data-rb-25000/tokenizer.json",  
                            padding=True, bos_token="<s>", 
                            eos_token="</s>",unk_token="<unk>", 
                            pad_token="<pad>")

Before training, the data set is tokenized and examples that have a too high token count are filtered out, like so:

MAX_SEQUENCE_LENGTH = 16_384 / 2

def preprocess_function(examples):
    inputs = tokenizer(
        examples['from'],
        truncation=False,  # Don't truncate yet
        padding=False,     # Don't pad yet
        return_length=True,
    )
    labels = tokenizer(
        examples['to'],
        truncation=False,
        padding=False,
        return_length=True,
    )

    inputs["input_length"] = inputs["length"]
    inputs["labels"] = labels["input_ids"]
    inputs["label_length"] = labels["length"]

    inputs.pop("length", None)

    return inputs

tokenized_data = dataset.map(preprocess_function, batched=True, remove_columns=dataset["train"].column_names)

def filter_function(example):
    return example['input_length'] <= MAX_SEQUENCE_LENGTH and example['label_length'] <= MAX_SEQUENCE_LENGTH

filtered_data = tokenized_data.filter(filter_function)

Training is done like this:

from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model="google/long-t5-tglobal-base")

from transformers import AutoModelForSeq2SeqLM, AutoConfig

config = AutoConfig.from_pretrained(
    "google/long-t5-tglobal-base",
    vocab_size=len(tokenizer),
    pad_token_id=tokenizer.pad_token_id,
    eos_token_id=tokenizer.eos_token_id,
    decoder_start_token_id=tokenizer.pad_token_id,
)

model = AutoModelForSeq2SeqLM.from_config(config)

from transformers import GenerationConfig

generation_config = GenerationConfig.from_model_config(model.config)
generation_config._from_model_config = False
generation_config.max_new_tokens = 16_384

from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments(
    output_dir="rb-25000-model",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=16,
    gradient_checkpointing=True,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=5,
    logging_steps=1,
    predict_with_generate=True,
    load_best_model_at_end=True,
    bf16=True,
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=filtered_data["train"],
    eval_dataset=filtered_data["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    generation_config=generation_config,
)

trainer.train()

I know that the tokenizer doesn't add the EOS token:

inputs = tokenizer(['Hello world', 'Hello'], padding=True, truncation=True, max_length=100, return_tensors="pt")
labels = inputs["input_ids"]

print(labels)
print(tokenizer.convert_tokens_to_ids(['<s>'])[0])
print(tokenizer.convert_tokens_to_ids(['<pad>'])[0])
print(tokenizer.convert_tokens_to_ids(['<unk>'])[0])
print(tokenizer.convert_tokens_to_ids(['</s>'])[0])

print(tokenizer.convert_ids_to_tokens([1]))

Output:

tensor([[1, 10356, 1, 5056],
        [1, 10356, 16002, 16002]])
16000
16002
0
16001
['▁']

(I don't really understand what's that strange token with index 1.

Anyway, I was wondering if the Trainer class or the DataCollator actually adds the EOS. I did not find any examples online of how and where to add EOS.

I suspect it's not there, because after training the model it doesn't stop generating until it reaches max_new_tokens (set to pretty high).

What's the best practice here? Where should I add EOS? Is there anything else about this code that should be checked or that looks weird for more experienced eyes?


Solution

  • The T5 tokenizer should end sequences by EOS token by default. Pretrained T5 tokenizer on HuggingFace does that by default. In fact, I found the function that is responsible for that in the source code on line 256:

    def _add_eos_if_not_present(self, token_ids: List[int]) -> List[int]:
            """Do not add eos again if user already added it."""
            if len(token_ids) > 0 and token_ids[-1] == self.eos_token_id:
                warnings.warn(
                    f"This sequence already has {self.eos_token}. In future versions this behavior may lead to duplicated"
                    " eos tokens being added."
                )
                return token_ids
            else:
                return token_ids + [self.eos_token_id]
    

    If EOS token is not appended by default, you can add a post processor to your tokenizer using TemplateProcessing:

    from tokenizers.processors import TemplateProcessing
    
    tokenizer._tokenizer.post_processor = TemplateProcessing(
        single="$A </s>",
        pair="$A </s> $B </s>",
        special_tokens=[("</s>", tokenizer.eos_token_id)]
    )
    
    inputs = tokenizer(['Hello world', 'Hello'], padding=True, truncation=True, max_length=100, return_tensors="pt")
    labels = inputs["input_ids"]
    print(labels)
    

    This should give:

    tensor([[1, 10356, 1, 5056, 16001],
            [1, 10356, 16001, 16002, 16002]])