Search code examples
pythonhuggingface-transformersbert-language-modelhuggingface-tokenizers

How to get the corresponding character or string that has been labelled as 'UNK' token in BERT?


After tokenization of a string it returns the token list consisting of separate words and special tokens. For instance, how to decode which word/character has been termed as 'UNK' token if there is any?


Solution

  • The fast tokenizers return a Batchencoding object that has a built-in word_ids and token_to_chars:

    from transformers import BertTokenizerFast
    
    t = BertTokenizerFast.from_pretrained('bert-base-uncased')
    
    tokens = t('word embeddings are vectors 😀')
    print(tokens['input_ids'])
    print(t.decode(tokens['input_ids']))
    print(tokens.word_ids())
    print(tokens.token_to_chars(8))
    

    Output:

    [101, 2773, 7861, 8270, 4667, 2015, 2024, 19019, 100, 102]
    [CLS] word embeddings are vectors [UNK] [SEP]
    [None, 0, 1, 1, 1, 1, 2, 3, 4, None]
    CharSpan(start=28, end=29)