python huggingface-transformers bert-language-model huggingface-tokenizers

How to get the corresponding character or string that has been labelled as 'UNK' token in BERT?

After tokenization of a string it returns the token list consisting of separate words and special tokens. For instance, how to decode which word/character has been termed as 'UNK' token if there is any?

Solution

The fast tokenizers return a Batchencoding object that has a built-in word_ids and token_to_chars:

from transformers import BertTokenizerFast

t = BertTokenizerFast.from_pretrained('bert-base-uncased')

tokens = t('word embeddings are vectors 😀')
print(tokens['input_ids'])
print(t.decode(tokens['input_ids']))
print(tokens.word_ids())
print(tokens.token_to_chars(8))

Output:

[101, 2773, 7861, 8270, 4667, 2015, 2024, 19019, 100, 102]
[CLS] word embeddings are vectors [UNK] [SEP]
[None, 0, 1, 1, 1, 1, 2, 3, 4, None]
CharSpan(start=28, end=29)