I have a small dataset for sentiment analysis. The classifier will be a simple KNN but I wanted to get the word embedding with the Bert
model from the transformers
library. Note that I just found out about this library - I am still learning.
So looking at online example, I am trying to understand the dimensions that are returned from the model.
Example:
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
tokens = tokenizer.encode(["Hello, my dog is cute", "He is really nice"])
print(tokens)
tokens = tokenizer.encode("Hello, my dog is cute", "He is really nice")
print(tokens)
tokens = tokenizer.encode(["Hello, my dog is cute"])
print(tokens)
tokens = tokenizer.encode("Hello, my dog is cute")
print(tokens)
The output is the following:
[101, 100, 100, 102]
[101, 7592, 1010, 2026, 3899, 2003, 10140, 102, 2002, 2003, 2428, 3835, 102]
[101, 100, 102]
[101, 7592, 1010, 2026, 3899, 2003, 10140, 102]
I can't seem to find the docs for encode()
- I have no idea why it returns different stuff when the input is passed as a list. What is this doing?
Additionally, is there a method to pass a word token and get the actual word back - to troubleshoot the above?
Thank you in advance
You can call tokenizer.convert_ids_to_tokens() to get the actual token for an id:
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
tokens = []
tokens.append(tokenizer.encode(["Hello, my dog is cute", "He is really nice"]))
tokens.append(tokenizer.encode("Hello, my dog is cute", "He is really nice"))
tokens.append(tokenizer.encode(["Hello, my dog is cute"]))
tokens.append(tokenizer.encode("Hello, my dog is cute"))
for t in tokens:
print(tokenizer.convert_ids_to_tokens(t))
Output:
['[CLS]', '[UNK]', '[UNK]', '[SEP]']
['[CLS]', 'hello', ',', 'my', 'dog', 'is', 'cute', '[SEP]', 'he', 'is', 'really', 'nice', '[SEP]']
['[CLS]', '[UNK]', '[SEP]']
['[CLS]', 'hello', ',', 'my', 'dog', 'is', 'cute', '[SEP]']
As you can see here, each of your inputs was tokenized and special tokens were added according your model (bert). The encode function hasn't processed your lists properly which could be a bug or intended beheaviour depending on how you define it because their is a method for batch processing batch_encode_plus
:
tokenizer.batch_encode_plus(["Hello, my dog is cute", "He is really nice"], return_token_type_ids=False, return_attention_mask=False)
Output:
{'input_ids': [[101, 7592, 1010, 2026, 3899, 2003, 10140, 102], [101, 2002, 2003, 2428, 3835, 102]]}
I'm not sure why the encode method is not documented, but it could be the case that huggingface wants us to use the call method directly:
tokens = []
tokens.append(tokenizer(["Hello, my dog is cute", "He is really nice"], return_token_type_ids=False, return_attention_mask=False))
tokens.append(tokenizer("Hello, my dog is cute", "He is really nice", return_token_type_ids=False, return_attention_mask=False))
tokens.append(tokenizer(["Hello, my dog is cute"], return_token_type_ids=False, return_attention_mask=False))
tokens.append(tokenizer("Hello, my dog is cute", return_token_type_ids=False, return_attention_mask=False))
print(tokens)
Output:
[{'input_ids': [[101, 7592, 1010, 2026, 3899, 2003, 10140, 102], [101, 2002, 2003, 2428, 3835, 102]]}, {'input_ids': [101, 7592, 1010, 2026, 3899, 2003, 10140, 102, 2002, 2003, 2428, 3835, 102]}, {'input_ids': [[101, 7592, 1010, 2026, 3899, 2003, 10140, 102]]}, {'input_ids': [101, 7592, 1010, 2026, 3899, 2003, 10140, 102]}]