python pytorch huggingface-transformers huggingface-tokenizers

How to get a probability distribution over tokens in a huggingface model?

I'm following this tutorial on getting predictions over masked words. The reason I'm using this one is because it seems to be working with several masked word simultaneously while other approaches I tried could only take 1 masked word at a time.

The code:

from transformers import RobertaTokenizer, RobertaForMaskedLM
import torch

tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
model = RobertaForMaskedLM.from_pretrained('roberta-base')

sentence = "Tom has fully ___ ___ ___ illness."


def get_prediction (sent):
    
    token_ids = tokenizer.encode(sent, return_tensors='pt')
    masked_position = (token_ids.squeeze() == tokenizer.mask_token_id).nonzero()
    masked_pos = [mask.item() for mask in masked_position ]

    with torch.no_grad():
        output = model(token_ids)

    last_hidden_state = output[0].squeeze()

    list_of_list =[]
    for index,mask_index in enumerate(masked_pos):
        mask_hidden_state = last_hidden_state[mask_index]
        idx = torch.topk(mask_hidden_state, k=5, dim=0)[1]
        words = [tokenizer.decode(i.item()).strip() for i in idx]
        list_of_list.append(words)
        print ("Mask ",index+1,"Guesses : ",words)
    
    best_guess = ""
    for j in list_of_list:
        best_guess = best_guess+" "+j[0]
        
    return best_guess


print ("Original Sentence: ",sentence)
sentence = sentence.replace("___","<mask>")
print ("Original Sentence replaced with mask: ",sentence)
print ("\n")

predicted_blanks = get_prediction(sentence)
print ("\nBest guess for fill in the blank :::",predicted_blanks)

How can I get the probability distribution over the 5 tokens instead of the indices of them? That is, similarly to how this approach (that I used before but once I change to multiple masked tokens I get an error) gets the score as an output:

from transformers import pipeline

# Initialize MLM pipeline
mlm = pipeline('fill-mask')

# Get mask token
mask = mlm.tokenizer.mask_token

# Get result for particular masked phrase
phrase = f'Read the rest of this {mask} to understand things in more detail'
result = mlm(phrase)

# Print result
print(result)

[{
    'sequence': 'Read the rest of this article to understand things in more detail',
    'score': 0.35419148206710815,
    'token': 1566,
    'token_str': ' article'
},...

Solution

The variable last_hidden_state[mask_index] is the logits for the prediction of the masked token. So to get token probabilities you can use a softmax over this, i.e.

probs = torch.nn.functional.softmax(last_hidden_state[mask_index])

You can then get the probabilities of the topk using

word_probs = [probs[i] for i in idx]

PS I assume you're aware that you should use <mask> rather then ___, i.e. sent = "Tom has fully <mask> <mask> <mask> illness.", I get the following:

Mask 1 Guesses : ['recovered', 'returned', 'cleared', 'recover', 'healed']

[tensor(0.9970), tensor(0.0007), tensor(0.0003), tensor(0.0003), tensor(0.0002)]

Mask 2 Guesses : ['from', 'his', 'with', 'to', 'the']

[tensor(0.5066), tensor(0.2048), tensor(0.0684), tensor(0.0513), tensor(0.0399)]

Mask 3 Guesses : ['his', 'the','mental', 'serious', 'this']

[tensor(0.5152), tensor(0.2371), tensor(0.0407), tensor(0.0257), tensor(0.0199)]