Search code examples
pytorchhuggingface-tokenizers

How to Suppress "Using bos_token, but it is not set yet..." in HuggingFace T5 Tokenizer


I'd like to turn off the warning that huggingface is generating when I use unique_no_split_tokens

In[2]   tokenizer = T5Tokenizer.from_pretrained("t5-base")
In[3]   tokenizer(" ".join([f"<extra_id_{n}>" for n in range(1,101)]), return_tensors="pt").input_ids.size()
Out[3]: torch.Size([1, 100])
    Using bos_token, but it is not set yet.
    Using cls_token, but it is not set yet.
    Using mask_token, but it is not set yet.
    Using sep_token, but it is not set yet.

Anyone know how to do this?


Solution

  • This solution worked for me:

    tokenizer.add_tokens([f"_{n}" for n in range(1,100)], special_tokens=True)
    model.resize_token_embeddings(len(tokenizer))
    tokenizer.save_pretrained('pathToExtendedTokenizer/')
    tokenizer = T5Tokenizer.from_pretrained("pathToExtendedTokenizer/")