huggingface-transformers bert-language-model huggingface-tokenizers

Truncating a training dataset so that it fits exactly within the context window

I have a dataset where the total tokens once tokenised is around 5000. I was to feed that into a BERT-style model so I have to shrink it down to 512 tokens but I want to rearrange the text to train it on fill-in-the-middle tasks using the techniques outlined in this paper: https://arxiv.org/abs/2207.14255 My issue is that I want to take the last 512 - 1 tokens and then prepend my <PRE> token to the beginning, but I'm finding it difficult to simply prepend a single token to my tokenised text without going through the process of encoding the text to tokens, truncating off the text on the left, then decoding the text, adding my <PRE> token and then re-encoding the text again. Is there a simpler way?

Here's what I have so far:

additional_special_tokens = ["<PRE>", "<SUF>", "<MID>"]

tokenizer = AutoTokenizer.from_pretrained(model_name, truncation_side="left")
tokenizer.additional_special_tokens = additional_special_tokens

small_eval_dataset = full_dataset["validation"].shuffle(42).select(range(1))

def build_training_data(examples):
    to_tokenized = examples["context"] + "<SUF><MID>" + examples["gt"]
    tokenized = tokenizer(to_tokenized, truncation=True)
    tokenized["input_ids"][0] = tokenizer("<PRE>")
    return tokenized


small_eval_dataset = small_eval_dataset.map(build_training_data)

I would like to have the text truncated on the left to 512 tokens, so that then I can feed that into my BERT-style model and have it train on this specific task.

Solution

First of all, to add special tokens to your tokenizer, you should use add_tokens method. Simply setting tokenizer.additional_special_tokens has no effect.

tokenizer = AutoTokenizer.from_pretrained(model_name, truncation_side="left")
tokenizer.add_tokens(additional_special_tokens, special_tokens=True)

print(tokenizer.added_tokens_encoder)
>>> {'[PAD]': 0, '[UNK]': 100, '[CLS]': 101, '[SEP]': 102, '[MASK]': 103, '<PRE>': 28996, '<SUF>': 28997, '<MID>': 28998}

After doing this, you should also resize the model embeddings (i.e. initialize random embeddings for the new tokens we just added) (source: How to add new special token to the tokenizer? ):

model.resize_token_embeddings(len(tokenizer))

Note that after training the model, it's a good idea to save the tokenizer using tokenizer.save_pretrained method, so that you can easily load it later.

Second, to prepend the <PRE> token to the tokenized text, you can directly modify the tokenized object:

def build_training_example(text):
    
    pre_token_id = tokenizer.get_vocab()['<PRE>']
    
    tokenized = tokenizer(text)
    tokenized['input_ids'] = [pre_token_id] + tokenized['input_ids'][-511:]
    tokenized['attention_mask'] = [1] + tokenized['attention_mask'][-511:]
    
    return tokenized

This function will truncate the last 511 tokens from the tokenized ids, and prepend the id of the <PRE> token.

I also added re-assigning the attention_mask, this might be useful if you have examples shorter than 512 and use padding to extend them to length 512, but you may or may not need this depending on how you plan to use the tokenized data. You may also consider using setting add_special_tokens=False in the tokenizer call, because if you want your data to always start with <PRE> token, it may be a good idea to avoid the [CLS] token, with which the tokenization starts if add_special_tokens is True.