Search code examples
pythonnlptokenizebert-language-modelsentence-transformers

ber-base-uncase does not use newly added suffix token


I want to add custom tokens to the BertTokenizer. However, the model does not use the new token.

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
tokenizer.add_tokens('##oldert')

text = "DocumentOlderThan"
tokens = tokenizer.tokenize(text)
print(tokens)

Output is:

['document', '##old', '##ert', '##han']

But I would expect:

['document', '##oldert', '##han']

How can I make the tokenizer use the new token instead of multiple old ones?


Solution

  • You need to update the tokenizers vocabulary, as such.

    from transformers import BertTokenizer
    tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
    added_tokens = ['##oldert']
    
    # Add new tokens to the tokenizer's vocabulary
    tokenizer.vocab.update({token: len(tokenizer.vocab) for token in added_tokens})
    tokenizer.ids_to_tokens.update({v: k for k, v in tokenizer.vocab.items()})
    
    text = "DocumentOlderThan"
    tokens = tokenizer.tokenize(text)
    print(tokens)
    

    Which results in: ['document', '##oldert', '##han']