Search code examples
pythonneural-networknlphuggingface-transformers

Identifying the word picked by hugging face pipeline fill-mask


I want to use hugging face's fill-mask pipeline to guess a masked token and then extract just the guessed token as a word. This code should do that:

!pip install -q transformers
model = pipeline('fill-mask')
outcome = model("Kubernetes is a container orchestration <mask>")[0]

#Prints: "Kubernetes is a container orchestration platform" 
print(outcome['sequence']) 

token = outcome['token'] 

#Prints: 1761
print(token)

#Prints: Ġplatform 
print(model.tokenizer.convert_ids_to_tokens(token))

But I am finding that it gives me back "Ġplatform" instead of "platform" - does anyone know why this is or what can be going on here?


Solution

  • This is simply a peculiarity of the underlying model (see here to check that this is distilroberta-base).
    Specifically, distilled models use the same tokenizer as their "teacher models" (in this case, RoBERTa). RoBERTa, in turn, has a tokenizer that is working strictly without any form of whitespaces, see also this thread on OpenAI's GPT-2 model, which is using the same tokenization strategy (see here).

    Specifically, you can note that it is always the same unicode character \u0120 that denotes the start of a new word. Comparatively, words that would consist of multiple subwords would have no such starting characters for the later subwords.

    I.e., complication would be split into two fictional subwords Ġcompli cation.

    Therefore, you can simply drop the Ġ if it appears in the word.