Search code examples
nlppytorchhuggingface-transformersbert-language-modeltransformer-model

How to calculate perplexity of a sentence using huggingface masked language models?


I have several masked language models (mainly Bert, Roberta, Albert, Electra). I also have a dataset of sentences. How can I get the perplexity of each sentence?

From the huggingface documentation here they mentioned that perplexity "is not well defined for masked language models like BERT", though I still see people somehow calculate it.

For example in this SO question they calculated it using the function

def score(model, tokenizer, sentence,  mask_token_id=103):
  tensor_input = tokenizer.encode(sentence, return_tensors='pt')
  repeat_input = tensor_input.repeat(tensor_input.size(-1)-2, 1)
  mask = torch.ones(tensor_input.size(-1) - 1).diag(1)[:-2]
  masked_input = repeat_input.masked_fill(mask == 1, 103)
  labels = repeat_input.masked_fill( masked_input != 103, -100)
  loss,_ = model(masked_input, masked_lm_labels=labels)
  result = np.exp(loss.item())
  return result

score(model, tokenizer, '我爱你') # returns 45.63794545581973

However, when I try to use the code I get TypeError: forward() got an unexpected keyword argument 'masked_lm_labels'.

I tried it with a couple of my models:

from transformers import pipeline, BertForMaskedLM, BertForMaskedLM, AutoTokenizer, RobertaForMaskedLM, AlbertForMaskedLM, ElectraForMaskedLM
import torch

1)
tokenizer = AutoTokenizer.from_pretrained("bioformers/bioformer-cased-v1.0")
model = BertForMaskedLM.from_pretrained("bioformers/bioformer-cased-v1.0")
2)
tokenizer = AutoTokenizer.from_pretrained("sultan/BioM-ELECTRA-Large-Generator")
model = ElectraForMaskedLM.from_pretrained("sultan/BioM-ELECTRA-Large-Generator")

This SO question also used the masked_lm_labels as an input and it seemed to work somehow.


Solution

  • There is a paper Masked Language Model Scoring that explores pseudo-perplexity from masked language models and shows that pseudo-perplexity, while not being theoretically well justified, still performs well for comparing "naturalness" of texts.

    As for the code, your snippet is perfectly correct but for one detail: in recent implementations of Huggingface BERT, masked_lm_labels are renamed to simply labels, to make interfaces of various models more compatible. I have also replaced the hard-coded 103 with the generic tokenizer.mask_token_id. So the snippet below should work:

    from transformers import AutoModelForMaskedLM, AutoTokenizer
    import torch
    import numpy as np
    
    model_name = 'cointegrated/rubert-tiny'
    model = AutoModelForMaskedLM.from_pretrained(model_name)
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    
    def score(model, tokenizer, sentence):
        tensor_input = tokenizer.encode(sentence, return_tensors='pt')
        repeat_input = tensor_input.repeat(tensor_input.size(-1)-2, 1)
        mask = torch.ones(tensor_input.size(-1) - 1).diag(1)[:-2]
        masked_input = repeat_input.masked_fill(mask == 1, tokenizer.mask_token_id)
        labels = repeat_input.masked_fill( masked_input != tokenizer.mask_token_id, -100)
        with torch.inference_mode():
            loss = model(masked_input, labels=labels).loss
        return np.exp(loss.item())
    
    print(score(sentence='London is the capital of Great Britain.', model=model, tokenizer=tokenizer)) 
    # 4.541251105675365
    print(score(sentence='London is the capital of South America.', model=model, tokenizer=tokenizer)) 
    # 6.162017238332462
    

    You can try this code in Google Colab by running this gist.