python machine-learning nlp bert-language-model huggingface-transformers

Tokens returned in transformers Bert model from encode()

I have a small dataset for sentiment analysis. The classifier will be a simple KNN but I wanted to get the word embedding with the Bert model from the transformers library. Note that I just found out about this library - I am still learning.

So looking at online example, I am trying to understand the dimensions that are returned from the model.

Example:

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

tokens = tokenizer.encode(["Hello, my dog is cute", "He is really nice"])
print(tokens)

tokens = tokenizer.encode("Hello, my dog is cute", "He is really nice")
print(tokens)

tokens = tokenizer.encode(["Hello, my dog is cute"])
print(tokens)

tokens = tokenizer.encode("Hello, my dog is cute")
print(tokens)

The output is the following:

[101, 100, 100, 102]

[101, 7592, 1010, 2026, 3899, 2003, 10140, 102, 2002, 2003, 2428, 3835, 102]

[101, 100, 102]

[101, 7592, 1010, 2026, 3899, 2003, 10140, 102]

I can't seem to find the docs for encode() - I have no idea why it returns different stuff when the input is passed as a list. What is this doing?

Additionally, is there a method to pass a word token and get the actual word back - to troubleshoot the above?

Thank you in advance

Solution

You can call tokenizer.convert_ids_to_tokens() to get the actual token for an id:

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

tokens = []

tokens.append(tokenizer.encode(["Hello, my dog is cute", "He is really nice"]))

tokens.append(tokenizer.encode("Hello, my dog is cute", "He is really nice"))

tokens.append(tokenizer.encode(["Hello, my dog is cute"]))

tokens.append(tokenizer.encode("Hello, my dog is cute"))

for t in tokens:
    print(tokenizer.convert_ids_to_tokens(t))

Output:

['[CLS]', '[UNK]', '[UNK]', '[SEP]']
['[CLS]', 'hello', ',', 'my', 'dog', 'is', 'cute', '[SEP]', 'he', 'is', 'really', 'nice', '[SEP]']
['[CLS]', '[UNK]', '[SEP]']
['[CLS]', 'hello', ',', 'my', 'dog', 'is', 'cute', '[SEP]']

As you can see here, each of your inputs was tokenized and special tokens were added according your model (bert). The encode function hasn't processed your lists properly which could be a bug or intended beheaviour depending on how you define it because their is a method for batch processing batch_encode_plus:

tokenizer.batch_encode_plus(["Hello, my dog is cute", "He is really nice"], return_token_type_ids=False, return_attention_mask=False)

Output:

{'input_ids': [[101, 7592, 1010, 2026, 3899, 2003, 10140, 102], [101, 2002, 2003, 2428, 3835, 102]]}

I'm not sure why the encode method is not documented, but it could be the case that huggingface wants us to use the call method directly:

tokens = []

tokens.append(tokenizer(["Hello, my dog is cute", "He is really nice"],  return_token_type_ids=False, return_attention_mask=False))

tokens.append(tokenizer("Hello, my dog is cute", "He is really nice",  return_token_type_ids=False, return_attention_mask=False))

tokens.append(tokenizer(["Hello, my dog is cute"], return_token_type_ids=False, return_attention_mask=False))

tokens.append(tokenizer("Hello, my dog is cute", return_token_type_ids=False, return_attention_mask=False))

print(tokens)

Output:

[{'input_ids': [[101, 7592, 1010, 2026, 3899, 2003, 10140, 102], [101, 2002, 2003, 2428, 3835, 102]]}, {'input_ids': [101, 7592, 1010, 2026, 3899, 2003, 10140, 102, 2002, 2003, 2428, 3835, 102]}, {'input_ids': [[101, 7592, 1010, 2026, 3899, 2003, 10140, 102]]}, {'input_ids': [101, 7592, 1010, 2026, 3899, 2003, 10140, 102]}]