I'm a little puzzled where (and if) EOS tokens are being added when using Huggignface's trainer classes to train a T5 (LongT5 actually) model.
The data set contains pairs of text like this:
from | to |
---|---|
some text | some corresponding text |
some other text | some other corresponding text |
The tokenizer has been custom trained:
tokenizer = SentencePieceUnigramTokenizer()
tokenizer.train_from_iterator(iterator=iterator, vocab_size=32_128, show_progress=True, unk_token="<unk>")
and is loaded like this:
tokenizer = T5TokenizerFast(tokenizer_file="data-rb-25000/tokenizer.json",
padding=True, bos_token="<s>",
eos_token="</s>",unk_token="<unk>",
pad_token="<pad>")
Before training, the data set is tokenized and examples that have a too high token count are filtered out, like so:
MAX_SEQUENCE_LENGTH = 16_384 / 2
def preprocess_function(examples):
inputs = tokenizer(
examples['from'],
truncation=False, # Don't truncate yet
padding=False, # Don't pad yet
return_length=True,
)
labels = tokenizer(
examples['to'],
truncation=False,
padding=False,
return_length=True,
)
inputs["input_length"] = inputs["length"]
inputs["labels"] = labels["input_ids"]
inputs["label_length"] = labels["length"]
inputs.pop("length", None)
return inputs
tokenized_data = dataset.map(preprocess_function, batched=True, remove_columns=dataset["train"].column_names)
def filter_function(example):
return example['input_length'] <= MAX_SEQUENCE_LENGTH and example['label_length'] <= MAX_SEQUENCE_LENGTH
filtered_data = tokenized_data.filter(filter_function)
Training is done like this:
from transformers import DataCollatorForSeq2Seq
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model="google/long-t5-tglobal-base")
from transformers import AutoModelForSeq2SeqLM, AutoConfig
config = AutoConfig.from_pretrained(
"google/long-t5-tglobal-base",
vocab_size=len(tokenizer),
pad_token_id=tokenizer.pad_token_id,
eos_token_id=tokenizer.eos_token_id,
decoder_start_token_id=tokenizer.pad_token_id,
)
model = AutoModelForSeq2SeqLM.from_config(config)
from transformers import GenerationConfig
generation_config = GenerationConfig.from_model_config(model.config)
generation_config._from_model_config = False
generation_config.max_new_tokens = 16_384
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments
training_args = Seq2SeqTrainingArguments(
output_dir="rb-25000-model",
eval_strategy="epoch",
save_strategy="epoch",
learning_rate=2e-5,
per_device_train_batch_size=1,
per_device_eval_batch_size=1,
gradient_accumulation_steps=16,
gradient_checkpointing=True,
weight_decay=0.01,
save_total_limit=3,
num_train_epochs=5,
logging_steps=1,
predict_with_generate=True,
load_best_model_at_end=True,
bf16=True,
)
trainer = Seq2SeqTrainer(
model=model,
args=training_args,
train_dataset=filtered_data["train"],
eval_dataset=filtered_data["test"],
tokenizer=tokenizer,
data_collator=data_collator,
compute_metrics=compute_metrics,
generation_config=generation_config,
)
trainer.train()
I know that the tokenizer doesn't add the EOS token:
inputs = tokenizer(['Hello world', 'Hello'], padding=True, truncation=True, max_length=100, return_tensors="pt")
labels = inputs["input_ids"]
print(labels)
print(tokenizer.convert_tokens_to_ids(['<s>'])[0])
print(tokenizer.convert_tokens_to_ids(['<pad>'])[0])
print(tokenizer.convert_tokens_to_ids(['<unk>'])[0])
print(tokenizer.convert_tokens_to_ids(['</s>'])[0])
print(tokenizer.convert_ids_to_tokens([1]))
Output:
tensor([[1, 10356, 1, 5056],
[1, 10356, 16002, 16002]])
16000
16002
0
16001
['▁']
(I don't really understand what's that strange token with index 1.
Anyway, I was wondering if the Trainer class or the DataCollator actually adds the EOS. I did not find any examples online of how and where to add EOS.
I suspect it's not there, because after training the model it doesn't stop generating until it reaches max_new_tokens (set to pretty high).
What's the best practice here? Where should I add EOS? Is there anything else about this code that should be checked or that looks weird for more experienced eyes?
The T5 tokenizer should end sequences by EOS token by default. Pretrained T5 tokenizer on HuggingFace does that by default. In fact, I found the function that is responsible for that in the source code on line 256:
def _add_eos_if_not_present(self, token_ids: List[int]) -> List[int]:
"""Do not add eos again if user already added it."""
if len(token_ids) > 0 and token_ids[-1] == self.eos_token_id:
warnings.warn(
f"This sequence already has {self.eos_token}. In future versions this behavior may lead to duplicated"
" eos tokens being added."
)
return token_ids
else:
return token_ids + [self.eos_token_id]
If EOS token is not appended by default, you can add a post processor to your tokenizer using TemplateProcessing:
from tokenizers.processors import TemplateProcessing
tokenizer._tokenizer.post_processor = TemplateProcessing(
single="$A </s>",
pair="$A </s> $B </s>",
special_tokens=[("</s>", tokenizer.eos_token_id)]
)
inputs = tokenizer(['Hello world', 'Hello'], padding=True, truncation=True, max_length=100, return_tensors="pt")
labels = inputs["input_ids"]
print(labels)
This should give:
tensor([[1, 10356, 1, 5056, 16001],
[1, 10356, 16001, 16002, 16002]])