After tokenization of a string it returns the token list consisting of separate words and special tokens. For instance, how to decode which word/character has been termed as 'UNK' token if there is any?
The fast tokenizers return a Batchencoding object that has a built-in word_ids and token_to_chars:
from transformers import BertTokenizerFast
t = BertTokenizerFast.from_pretrained('bert-base-uncased')
tokens = t('word embeddings are vectors 😀')
print(tokens['input_ids'])
print(t.decode(tokens['input_ids']))
print(tokens.word_ids())
print(tokens.token_to_chars(8))
Output:
[101, 2773, 7861, 8270, 4667, 2015, 2024, 19019, 100, 102]
[CLS] word embeddings are vectors [UNK] [SEP]
[None, 0, 1, 1, 1, 1, 2, 3, 4, None]
CharSpan(start=28, end=29)