machine-learning nlp huggingface-transformers huggingface

What is the correct approach to evaluate Huggingface models on the masked language modeling task?

I'm trying to test how well different models are doing on the masked language modeling task.

Given a prompt

prompt = "The Milky Way is a [MASK] galaxy"

I'm trying to get an output for the masked token from different models. The issue is that when I load a model for the masked language modeling task:

from transformers import AutoModelForMaskedLM, AutoTokenizer
model = AutoModelForMaskedLM.from_pretrained('bert-base-cased')
model.eval()
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased', truncation=True)

I get the warning:

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'bert.pooler.dense.bias', 'cls.seq_relationship.weight', 'bert.pooler.dense.weight']

From the huggingface forum they had a similar question, but only referred to parts of these weights: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']. The answer there was:

>>> It tells you that by loading the bert-base-uncased checkpoint in the BertForMaskedLM architecture, you're dropping two weights: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias'].

These are the weights used for next-sentence prediction, which aren't necessary for Masked Language Modeling.
If you're only interested in doing masked language modeling, then you can safely disregard this warning.

But it seems like when I'm loading it I'm also removing ['bert.pooler.dense.bias', 'bert.pooler.dense.weight'], which I could not find out if dropping these is going to result in different performance for the masked language modeling without fine-tuning the model.

If I just load the model with

model = AutoModel.from_pretrained('bert-base-cased')

I get no error, but then I cannot use it to predict the mask (as far as I know).

So is the correct approach to load the model with AutoModelForMaskedLM as I've done and just ignore the warning (assuming that it makes no difference because the dropped weights are useless for the masked task), or is there a different approach?

Solution

As pointed out in the linked post, this is a warning to indicate that those weights are not used. This is raised since you're loading a model that has its pooler weights initialised (bert-base-cased), but aren't used by a *MaskedLM model. The bert.pooler.* weights are typically used for classification tasks (such as BertForSequenceClassification). For bert-base-cased, the model was also trained on the Next Sentence Prediction (NSP) task, so the pooler weights are also trained. As pointed out in this GitHub comment:

After passing a sentence through the model, the representation corresponding to the first token in the output is used for fine-tuning on tasks like SQuAD and GLUE. So the pooler layer does precisely that, applies a linear transformation over the representation of the first token. The linear transformation is trained while using the Next Sentence Prediction (NSP) strategy.

In fact, whenever initialising the model with the Next Sentence Prediction task, the warning isn't raised:

from transformers import AutoModelForMaskedLM, AutoModelForNextSentencePrediction, AutoModelForPreTraining

# raises a warning
AutoModelForMaskedLM.from_pretrained("bert-base-cased")

# doesn't raise a warning
AutoModelForNextSentencePrediction.from_pretrained("bert-base-cased")

# doesn't raise a warning; initialised with both MLM & NSP
AutoModelForPreTraining.from_pretrained("bert-base-cased")

Since you're interested in Masked Language Modelling (MLM), you can disregard the warning since this isn't used for this task. For the masked language modelling task you should initialise using AutoModelForMaskedLM, since this includes the appropriate head to predict the masked token. This forum post has further details about the differences in initialisations.