Search code examples
pytorchhuggingface-transformershuggingface-tokenizers

Get the index of subwords produced by BertTokenizer (in transformers library)


BertTokenizer can tokenize a sentence to a list of tokens, where some long words e.g. "embeddings" is splitted to several subwords i.e. 'em', '##bed', '##ding', and '##s'.

Is there a way to locate the subwords? For example,

t = BertTokenizer.from_pretrained('bert-base-uncased')

tokens = t('word embeddings', add_special_tokens=False)
location = locate_subwords(tokens)

I want the location be like [0, 1, 1, 1, 1] corresponding to ['word', 'em', '##bed', '##ding', '##s'], where 0 means normal word, 1 means subword.


Solution

  • The fast tokenizers return a Batchencoding object that has a built-in word_ids:

    from transformers import BertTokenizerFast
    
    t = BertTokenizerFast.from_pretrained('bert-base-uncased')
    
    tokens = t('word embeddings are vectors', add_special_tokens=False, return_attention_mask=False, return_token_type_ids=False)
    print(tokens.word_ids())
    

    Output:

    [0, 1, 1, 1, 1, 2, 3]