Given the following,
from transformers import TFAutoModel
from transformers import BertTokenizer
bert = TFAutoModel.from_pretrained('bert-base-cased')
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
I expected that if special tokens are added to the tokens, the remaining tokens would remain the same and yet they do not. For example I expected that the following should be equal but all the tokens change. Why is this?
tokens = tokenizer(['this product is no good'], add_special_tokens=True,return_tensors='tf')
output = bert(tokens)
output[0][0][1]
tokens = tokenizer(['this product is no good'], add_special_tokens=False,return_tensors='tf')
output = bert(tokens)
output[0][0][0]
When setting add_special_tokens=True
, you are including the [CLS]
token in the front and the [SEP]
token at the end of your sentence, which leads to a total of 7 tokens instead of 5:
tokens = tokenizer(['this product is no good'], add_special_tokens=True, return_tensors='tf')
print(tokenizer.convert_ids_to_tokens(tf.squeeze(tokens['input_ids'], axis=0)))
['[CLS]', 'this', 'product', 'is', 'no', 'good', '[SEP]']
Your sentence level embeddings are different, because these two special
tokens become a part of your embedding as they are propagated through the BERT model. They are not masked like padding tokens [pad]. Check out the docs for more information.
If you take a closer look at how Bert's Transformer-Encoder architecture and attention mechanism works, you will quickly understand why a single difference between two sentences will generate different hidden_states
. New tokens are not simply concatenated to existing ones. In a sense, the tokens depend on each other. According to the BERT author Jacob Devlin:
I'm not sure what these vectors are, since BERT does not generate meaningful sentence vectors. It seems that this is doing average pooling over the word tokens to get a sentence vector, but we never suggested that this will generate meaningful sentence representations.
Or another interesting discussion:
[...] The value of CLS is influenced by other tokens, just like other tokens are influenced by their context (attention).