Search code examples
pythonmachine-learningnlpbert-language-modelhuggingface-transformers

Tokens returned in transformers Bert model from encode()


I have a small dataset for sentiment analysis. The classifier will be a simple KNN but I wanted to get the word embedding with the Bert model from the transformers library. Note that I just found out about this library - I am still learning.

So looking at online example, I am trying to understand the dimensions that are returned from the model.

Example:

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

tokens = tokenizer.encode(["Hello, my dog is cute", "He is really nice"])
print(tokens)

tokens = tokenizer.encode("Hello, my dog is cute", "He is really nice")
print(tokens)

tokens = tokenizer.encode(["Hello, my dog is cute"])
print(tokens)

tokens = tokenizer.encode("Hello, my dog is cute")
print(tokens)

The output is the following:

[101, 100, 100, 102]

[101, 7592, 1010, 2026, 3899, 2003, 10140, 102, 2002, 2003, 2428, 3835, 102]

[101, 100, 102]

[101, 7592, 1010, 2026, 3899, 2003, 10140, 102]

I can't seem to find the docs for encode() - I have no idea why it returns different stuff when the input is passed as a list. What is this doing?

Additionally, is there a method to pass a word token and get the actual word back - to troubleshoot the above?

Thank you in advance


Solution

  • You can call tokenizer.convert_ids_to_tokens() to get the actual token for an id:

    from transformers import BertTokenizer
    
    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
    
    tokens = []
    
    tokens.append(tokenizer.encode(["Hello, my dog is cute", "He is really nice"]))
    
    tokens.append(tokenizer.encode("Hello, my dog is cute", "He is really nice"))
    
    tokens.append(tokenizer.encode(["Hello, my dog is cute"]))
    
    tokens.append(tokenizer.encode("Hello, my dog is cute"))
    
    for t in tokens:
        print(tokenizer.convert_ids_to_tokens(t))
    

    Output:

    ['[CLS]', '[UNK]', '[UNK]', '[SEP]']
    ['[CLS]', 'hello', ',', 'my', 'dog', 'is', 'cute', '[SEP]', 'he', 'is', 'really', 'nice', '[SEP]']
    ['[CLS]', '[UNK]', '[SEP]']
    ['[CLS]', 'hello', ',', 'my', 'dog', 'is', 'cute', '[SEP]']
    

    As you can see here, each of your inputs was tokenized and special tokens were added according your model (bert). The encode function hasn't processed your lists properly which could be a bug or intended beheaviour depending on how you define it because their is a method for batch processing batch_encode_plus:

    tokenizer.batch_encode_plus(["Hello, my dog is cute", "He is really nice"], return_token_type_ids=False, return_attention_mask=False)
    

    Output:

    {'input_ids': [[101, 7592, 1010, 2026, 3899, 2003, 10140, 102], [101, 2002, 2003, 2428, 3835, 102]]}
    

    I'm not sure why the encode method is not documented, but it could be the case that huggingface wants us to use the call method directly:

    tokens = []
    
    tokens.append(tokenizer(["Hello, my dog is cute", "He is really nice"],  return_token_type_ids=False, return_attention_mask=False))
    
    tokens.append(tokenizer("Hello, my dog is cute", "He is really nice",  return_token_type_ids=False, return_attention_mask=False))
    
    tokens.append(tokenizer(["Hello, my dog is cute"], return_token_type_ids=False, return_attention_mask=False))
    
    tokens.append(tokenizer("Hello, my dog is cute", return_token_type_ids=False, return_attention_mask=False))
    
    print(tokens)
    

    Output:

    [{'input_ids': [[101, 7592, 1010, 2026, 3899, 2003, 10140, 102], [101, 2002, 2003, 2428, 3835, 102]]}, {'input_ids': [101, 7592, 1010, 2026, 3899, 2003, 10140, 102, 2002, 2003, 2428, 3835, 102]}, {'input_ids': [[101, 7592, 1010, 2026, 3899, 2003, 10140, 102]]}, {'input_ids': [101, 7592, 1010, 2026, 3899, 2003, 10140, 102]}]