Search code examples
nlpbert-language-modelhuggingface-transformershuggingface-tokenizers

BERT - Is that needed to add new tokens to be trained in a domain specific environment?


My question here is no how to add new tokens, or how to train using a domain-specific corpus, I'm already doing that.

The thing is, am I supposed to add the domain-specific tokens before the MLM training, or I just let Bert figure out the context? If I choose to not include the tokens, am I going to get a poor task-specific model like NER?

To give you more background of my situation, I'm training a Bert model on medical text using Portuguese language, so, deceased names, drug names, and other stuff are present in my corpus, but I'm not sure I have to add those tokens before the training.

I saw this one: Using Pretrained BERT model to add additional words that are not recognized by the model

But the doubts remain, as other sources say otherwise.

Thanks in advance.


Solution

  • Yes, you have to add them to the models vocabulary.

    tokenizer = BertTokenizer.from_pretrained(model_name)
    tokenizer.add_tokens(['new', 'rdemorais', 'blabla'])
    model = Bert.from_pretrained(model_name, return_dict=False)
         
    model.resize_token_embeddings(len(tokenizer))
    

    The last line is important and needed since you change the numbers of tokens in the model's vocabulary, you also need to update the model correspondingly.