python nlp pytorch huggingface-transformers bert-language-model

BERT Domain Adaptation

I am using transformers.BertForMaskedLM to further pre-train the BERT model on my custom dataset. I first serialize all the text to a .txt file by separating the words by a whitespace. Then, I am using transformers.TextDataset to load the serialized data with a BERT tokenizer given as tokenizer argument. Then, I am using BertForMaskedLM.from_pretrained() to load the pre-trained model (which is what transformers library presents). Then, I am using transformers.Trainer to further pre-train the model on my custom dataset, i.e., domain adaptation, for 3 epochs. I save the model with trainer.save_model(). Then, I want to load the further pre-trained model to get the embeddings of the words in my custom dataset. To load the model, I am using AutoModel.from_pretrained() but this pops up a warning.

Some weights of the model checkpoint at {path to my further pre-trained model} were not used when initializing BertModel

So, I know why this pops up. Because I further pre-trained using transformers.BertForMaskedLM but when I load with transformers.AutoModel, it loads it as transformers.BertModel. What I do not understand is if this is a problem or not. I just want to get the embeddings, e.g., embedding vector with a size of 768.

Solution

You saved a BERT model with LM head attached. Now you are going to load the serialized file into a standalone BERT structure without any extra element and the warning is issued. This is pretty normal and there is no Fatal error to do so! You can check the list of unloaded params like below:

from transformers import BertTokenizer, BertModel
from transformers import BertTokenizer, BertLMHeadModel, BertConfig
import torch

lmbert = BertLMHeadModel.from_pretrained('bert-base-cased', config=config)
lmbert.save_pretrained('you_desired_path/BertLMHeadModel')
lmbert_params = []
for name, param in lmbert.named_parameters():
    lmbert_params.append(name)
bert = BertModel.from_pretrained('you_desired_path/BertLMHeadModel')
bert_params = []
for name, param in bert.named_parameters():
    bert_params.append(name)
params_ralated_to_lm_head = [param_name for param_name in lmbert_params if param_name.replace('bert.', '') not in bert_params]
params_ralated_to_lm_head

output:

['cls.predictions.bias',
 'cls.predictions.transform.dense.weight',
 'cls.predictions.transform.dense.bias',
 'cls.predictions.transform.LayerNorm.weight',
 'cls.predictions.transform.LayerNorm.bias']